Background: This report explores the application of clustering analysis to identify distinct groups within a dataset related to diabetes risk factors. The dataset includes variables such as age, glucose levels, insulin levels, BMI, blood pressure, and genetic predisposition factors.
Objective: The objective was to categorize individuals into clusters based on their health profiles and assess their respective risks of developing diabetes. By understanding these clusters, targeted interventions could be developed to mitigate diabetes risks effectively.
Results: Four distinct clusters were identified: - Cluster 1: Young adults with low diabetes risk. - Cluster 2: Middle-aged individuals with high diabetes risk. - Cluster 3: Young adults with moderate diabetes risk. - Cluster 4: Older adults with moderate diabetes risk.
Each cluster exhibited unique characteristics in terms of age, glucose levels, insulin levels, BMI, blood pressure, and genetic predisposition. Strategic interventions were recommended for each cluster to optimize diabetes prevention and management efforts.
Conclusion: By tailoring interventions to the specific needs of each cluster, healthcare providers can enhance the effectiveness of diabetes prevention strategies. This targeted approach not only improves health outcomes but also contributes to reducing the overall burden of diabetes in the population.
In this section, an exploratory data analysis (EDA) was conducted on the diabetes dataset. The primary objective was to understand the distribution of each variable, identify missing values, and explore potential relationships between features. This analysis served as a foundation for subsequent modeling and predictive analysis.
The dataset was loaded and the first few rows were displayed to get an initial glimpse of the data structure. A summary of the dataset was generated to gain insights into the central tendency and dispersion of each feature. Missing values were checked in the dataset, as they can significantly impact the analysis and modeling. The number of missing values in each column was calculated and displayed.
## Rows: 768 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (9): Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI, D...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 6 × 9
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 6 148 72 35 0 33.6
## 2 1 85 66 29 0 26.6
## 3 8 183 64 0 0 23.3
## 4 1 89 66 23 94 28.1
## 5 0 137 40 35 168 43.1
## 6 5 116 74 0 0 25.6
## # ℹ 3 more variables: DiabetesPedigreeFunction <dbl>, Age <dbl>, Outcome <dbl>
## Pregnancies Glucose BloodPressure SkinThickness
## Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 62.00 1st Qu.: 0.00
## Median : 3.000 Median :117.0 Median : 72.00 Median :23.00
## Mean : 3.845 Mean :120.9 Mean : 69.11 Mean :20.54
## 3rd Qu.: 6.000 3rd Qu.:140.2 3rd Qu.: 80.00 3rd Qu.:32.00
## Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00
## Insulin BMI DiabetesPedigreeFunction Age
## Min. : 0.0 Min. : 0.00 Min. :0.0780 Min. :21.00
## 1st Qu.: 0.0 1st Qu.:27.30 1st Qu.:0.2437 1st Qu.:24.00
## Median : 30.5 Median :32.00 Median :0.3725 Median :29.00
## Mean : 79.8 Mean :31.99 Mean :0.4719 Mean :33.24
## 3rd Qu.:127.2 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00
## Max. :846.0 Max. :67.10 Max. :2.4200 Max. :81.00
## Outcome
## Min. :0.000
## 1st Qu.:0.000
## Median :0.000
## Mean :0.349
## 3rd Qu.:1.000
## Max. :1.000
# Check for missing values
missing_values <- sapply(data, function(x) sum(is.na(x)))
print(missing_values)## Pregnancies Glucose BloodPressure
## 0 0 0
## SkinThickness Insulin BMI
## 0 0 0
## DiabetesPedigreeFunction Age Outcome
## 0 0 0
Histograms were constructed for each feature to visualise their distributions. These visualisations highlight how data points are distributed across different ranges, offering insights into the prevalence and spread of each variable.
Histograms
# Helper function to create histograms
create_histogram <- function(data, column, title, binwidth, fill_color) {
ggplot(data, aes_string(x = column)) +
geom_histogram(aes(y = 100 * (..count..) / sum(..count..)), binwidth = binwidth, colour = "black", fill = fill_color) +
ggtitle(title) +
ylab("Percentage") +
theme_minimal() +
theme(plot.title = element_text(size = 14, face = "bold"),
axis.title = element_text(size = 12),
axis.text = element_text(size = 8))
}
# Create histograms for each feature with new colors
p1 <- create_histogram(data, "Pregnancies", "Number of Pregnancies", 1, "#1f77b4") # Blue## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
p2 <- create_histogram(data, "Glucose", "Glucose", 5, "#ff7f0e") # Orange
p3 <- create_histogram(data, "BloodPressure", "Blood Pressure", 2, "#2ca02c") # Green
p4 <- create_histogram(data, "SkinThickness", "Skin Thickness", 2, "#d62728") # Red
p5 <- create_histogram(data, "Insulin", "Insulin", 20, "#9467bd") # Purple
p6 <- create_histogram(data, "BMI", "Body Mass Index", 1, "#8c564b") # Brown
p7 <- create_histogram(data, "DiabetesPedigreeFunction", "Diabetes Pedigree Function", 0.05, "#e377c2") # Pink
p8 <- create_histogram(data, "Age", "Age", 1, "#7f7f7f") # Gray
# Arrange plots in a grid layout with larger size
grid.arrange(p1, p2, p3, p4, p5, p6, p7, p8, ncol = 2)## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
### Histogram Analysis
The histogram reveals that a significant portion of individuals in the dataset have 0 to 5 pregnancies, indicating a prevalent demographic within the study group. A smaller proportion of individuals have higher numbers of pregnancies, such as 10, 15, and 20.
The distribution of glucose levels appears normal, with a peak around 80 to 150. This suggests that most individuals in the dataset have glucose levels within this range, which is crucial for understanding metabolic health.
The histogram for blood pressure shows a right-skewed distribution, indicating that a substantial number of individuals have blood pressure readings clustered around 70 to 80. This skewness implies potential variations in blood pressure across the dataset.
Skin thickness distribution is also right-skewed, with most individuals having thickness measurements between 20 and 40. This metric is essential in assessing overall health and potential metabolic conditions.
The insulin distribution is is approximately, with a concentration of individuals showing medium insulin levels, particularly around 100.
The histogram for the diabetes pedigree function reveals a right-skewed distribution, with the majority of individuals having function values less than 1. This metric provides insights into the genetic predisposition to diabetes within the study population.
BMI distribution appears roughly normal, centered around 20 to 40. This standard measure of body composition highlights the prevalence of healthy weight ranges within the dataset.
The age distribution is right-skewed, indicating that a significant number of individuals are younger, with ages clustering around 20 to 40. Understanding age demographics is crucial for analysing health outcomes across different age groups.
the histograms provide valuable insights into the distribution and central tendencies of critical health metrics within the diabetes dataset. These findings serve as a foundational analysis for further exploration and modeling efforts, contributing to informed decision-making in healthcare and medical research.
Density plots were created to explore the distribution of each feature, segmented by diabetes outcome. These plots help identify potential patterns or differences in feature distributions between individuals with and without diabetes. Furthermore, scatter plots were used to visualise the relationship between pairs of features, with data points color-coded based on the outcome variable (diabetes presence). This helped in identifying any patterns or trends that existed between these features.
Density Plot
# Helper function to create density plots with outcome comparison
create_density_with_outcome <- function(data, column, title) {
mean_values <- data %>%
group_by(Outcome) %>%
summarize(mean_value = mean(get(column), na.rm = TRUE)) %>%
ungroup()
ggplot(data, aes_string(x = column, fill = "as.factor(Outcome)")) +
geom_density(alpha = 0.5) +
geom_vline(data = mean_values, aes(xintercept = mean_value, color = as.factor(Outcome)),
linetype = "dotted", size = 1) +
scale_fill_manual(values = c("#FFFF00", "#008080")) + # Yellow and Teal hex codes
scale_color_manual(values = c("red", "blue")) +
labs(title = title, fill = "Outcome", color = "Outcome") +
theme_minimal() +
theme(plot.title = element_text(size = 10, face = "bold"),
axis.title = element_text(size = 10),
axis.text = element_text(size = 8))
}
# Create density plots for each feature with outcome comparison
p1 <- create_density_with_outcome(data, "Pregnancies", "Pregnancies vs Diabetes")## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
p2 <- create_density_with_outcome(data, "Glucose", "Glucose vs Diabetes")
p3 <- create_density_with_outcome(data, "BloodPressure", "Blood Pressure vs Diabetes")
p4 <- create_density_with_outcome(data, "SkinThickness", "Skin Thickness vs Diabetes")
p5 <- create_density_with_outcome(data, "Insulin", "Insulin vs Diabetes")
p6 <- create_density_with_outcome(data, "BMI", "BMI vs Diabetes")
p7 <- create_density_with_outcome(data, "DiabetesPedigreeFunction", "Diabetes Pedigree Function vs Diabetes")
p8 <- create_density_with_outcome(data, "Age", "Age vs Diabetes")
# Arrange plots in a grid layout with larger size
grid.arrange(p1, p2, p3, p4, p5, p6, p7, p8, ncol = 2)
### Density Plot Analysis
The density plots show that individuals with diabetes tend to have a slightly higher mean number of pregnancies compared to those without diabetes. This suggests a possible correlation between higher pregnancy numbers and diabetes risk.
The distribution of glucose levels shows that individuals with diabetes have significantly higher mean glucose levels than those without diabetes. This strong differentiation highlights glucose as a critical factor in diabetes diagnosis and management.
Blood pressure distributions reveal a subtle difference in mean values between individuals with and without diabetes. While there is a slight variation, it suggests that blood pressure alone may not be a strong differentiator for diabetes in this population.
The skin thickness density plots indicate little difference between the two groups, suggesting that this metric does not strongly distinguish between diabetes and non-diabetes individuals.
The insulin level distributions show that individuals with diabetes tend to have slightly higher mean insulin levels. This finding supports the role of hyperinsulinaemia in the development of diabetes.
BMI is higher on average for individuals with diabetes. This correlation aligns with known associations between higher body mass index and increased diabetes risk.
The diabetes pedigree function values are slightly higher for individuals with diabetes, indicating a possible genetic predisposition in these cases.
Older individuals tend to have a higher mean age in the diabetes group. This suggests that age is a significant factor in the prevalence of diabetes, with older individuals being more at risk.
Scatter Plot
# Function to create scatter plot with outcome comparison
create_scatter_with_outcome <- function(data, x_col, y_col, x_title, y_title) {
ggplot(data, aes_string(x = x_col, y = y_col, color = "as.factor(Outcome)")) +
geom_point(alpha = 0.7) +
labs(x = x_title, y = y_title, color = "Outcome") +
theme_minimal() +
theme(plot.title = element_text(size = 14, face = "bold"),
axis.title = element_text(size = 12),
axis.text = element_text(size = 10))
}
# Create scatter plot for a couple of variables
scatter_plot1 <- create_scatter_with_outcome(data, "Glucose", "BMI", "Glucose", "BMI")
scatter_plot2 <- create_scatter_with_outcome(data, "Age", "BloodPressure", "Age", "Blood Pressure")
# Arrange plots in a grid layout
grid.arrange(scatter_plot1, scatter_plot2, ncol = 2)
### Scatter Plot Analysis
The scatter plot of Glucose levels against BMI (Body Mass Index) offers valuable insights into the relationship between these two variables. Glucose levels range from 0 to 200, while BMI values span from 0 to 60.
In examining the scatter plot, a noticeable positive correlation
between Glucose levels and BMI is evident.
Higher Glucose levels generally correspond with higher BMI
values. This trend is particularly observable among individuals with
diabetes, who typically exhibit both elevated Glucose and
BMI levels compared to those without diabetes. However, the
distinction between the two groups in this plot is not very pronounced,
suggesting that additional factors might also play significant roles in
differentiating between the outcomes of diabetes and non-diabetes.
The scatter plot analysing Age versus
Blood Pressure delves into the interaction between these
two variables. Age ranges from 20 to 80 years, and Blood
Pressure values extend from 0 to 125.
Observations from this scatter plot indicate no clear linear
relationship between Age and Blood Pressure.
The data points are widely scattered, highlighting a substantial
variability in Blood Pressure across different ages. This variability
suggests that Blood Pressure is influenced by multiple factors beyond
Age alone. Despite the lack of a strong linear relationship, a slight
trend of increasing Blood Pressure with Age is
discernible. This trend aligns with the general medical understanding
that Blood Pressure tends to rise as individuals age.
These scatter plots visually depict the complex relationships between these health metrics and their association with diabetes.
Pairwise scatter plots were utilized to examine relationships between pairs of numeric variables. Each plot included data points colored by diabetes outcome, facilitating visual identification of correlations or trends between variables.
Pairwise Scatter Plot
# Convert Outcome to factor with appropriate levels
data$Outcome <- factor(data$Outcome, levels = c(0, 1))
# Select only numeric columns (excluding "Outcome")
numeric_data <- data[, sapply(data, is.numeric) & !(names(data) %in% "Outcome")]
# Define a custom color palette for Outcome
my_colors <- c("#1f77b4", "#ff7f0e") # Blue and Orange
# Create a custom wrap function for points to include color
wrap_points <- function(data, mapping, ...) {
ggplot(data = data, mapping = mapping) +
geom_point(alpha = 0.5, ...) +
scale_color_manual(values = my_colors)
}
# Plot pairwise scatter plots using GGally with custom aesthetics
ggpairs(data,
columns = which(sapply(data, is.numeric) & !(names(data) %in% "Outcome")),
mapping = ggplot2::aes(color = Outcome),
lower = list(continuous = wrap_points),
upper = list(continuous = wrap("cor", size = 3)),
diag = list(continuous = wrap("barDiag", binwidth = 1)),
title = "Pairwise Scatter Plots of Numeric Variables"
)
### Pairwise Scatter Plot Analysis
This analysis explores the relationships between different numerical variables related to diabetes through scatter plots. Each plot highlights the interaction between two variables, providing insights into potential correlations and patterns.
The scatter plots show that the number of pregnancies has a weak
positive correlation with Glucose levels. However, no
strong patterns or significant correlations emerge with other variables,
indicating that the number of pregnancies may not be a strong predictor
for other health metrics in this dataset.
Glucose levels exhibit a positive correlation with BMI
and Insulin (Body Mass Index), suggesting that individuals
with higher glucose levels tend to have higher BMI and
Insulin values. This correlation aligns with the
understanding that elevated glucose levels, hyperinsulinaemia and
increased body weight are often linked. However, no clear patterns are
observed between glucose levels and other variables.
Blood pressure does not show strong correlations with other variables in the dataset. The scatter plots reveal a wide dispersion of blood pressure values across different levels of other variables, indicating a lack of significant linear relationships.
Skin thickness does not demonstrate significant correlations with other variables. The scatter plots suggest that skin thickness is relatively independent of other health metrics in this dataset, showing no strong linear patterns.
Insulin levels show strong positive correlation with glucose levels and have a weak positive correlation with BMI, suggesting that individuals with higher insulin levels may also have higher BMI values. However, no strong patterns are observed between insulin levels and other variables, indicating that insulin is not a strong predictor of other health metrics in this dataset.
BMI shows a positive correlation with glucose levels, reinforcing the link between increased body weight and higher glucose levels. However, BMI does not exhibit clear patterns with other variables, suggesting that while BMI and glucose levels are related, BMI alone is not strongly predictive of other health metrics.
The diabetes pedigree function does not show strong correlations with other variables. The scatter plots indicate that this metric, which reflects genetic predisposition to diabetes, operates independently of the other health metrics in this dataset.
Age exhibits a weak positive correlation with blood pressure, aligning with the understanding that blood pressure tends to increase with age. However, no strong patterns are observed between age and other variables, indicating that age alone is not a strong predictor of other health metrics in this dataset.
These scatter plots provide a visual overview of the relationships between various health metrics related to diabetes. While some correlations and patterns are observable, many variables do not show strong linear relationships, highlighting the complexity of predicting diabetes and related health outcomes based on these metrics alone.
Box plots were generated to illustrate the distribution of numerical variables across different diabetes outcomes. Each plot depicts the spread and central tendency of variables within each outcome category, offering insights into potential differences between groups.
# Define custom colors
my_colors <- c("#FFFF66", "#66CCCC", "#FF9966") # Yellow, Teal, Peach
# Reshape data for ggplot
data_long <- data %>%
gather(key = "variable", value = "value", -Outcome)
# Create box plots using ggplot
ggplot(data_long, aes(x = Outcome, y = value, fill = as.factor(Outcome))) +
geom_boxplot() +
facet_wrap(~ variable, scales = "free") +
labs(title = "Box Plots of Numerical Variables by Outcome") +
theme_minimal() +
scale_fill_manual(values = my_colors) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Rotate x-axis labels if needed
### Box Plot Analysis
The box plots provide a comparative visualisation of numerical variables based on the diabetes outcome categories (labeled as ‘0’ for no diabetes and ‘1’ for diabetes). These plots help in understanding the distribution and spread of each variable within the two outcome groups.
The median age is higher for individuals in the diabetic category. However, the age distribution for individuals with diabetes (outcome ‘1’) is wider, indicating a greater variability in age among diabetic individuals compared to those without diabetes (outcome ‘0’).
The median blood pressure is slightly higher for individuals with diabetes (outcome ‘1’). The variability is similar in blood pressure values for both groups.
The median BMI is higher for the diabetes group (outcome ‘1’), indicating that individuals with diabetes tend to have a higher body mass index. Additionally, the range of BMI values is wider for this group, reflecting greater variability in body weight among diabetic individuals.
The median value for the diabetes pedigree function is slightly higher for individuals with diabetes (outcome ‘1’). This function, which represents genetic predisposition, shows more variability among those with diabetes, suggesting diverse genetic factors at play.
The median glucose level is significantly higher for the diabetes group (outcome ‘1’). There is less overlap between the two outcomes for glucose, indicating that higher glucose levels are strongly associated with diabetes.
Individuals with diabetes (outcome ‘1’) have a lower median insulin level. However, the overall trend shows that higher insulin levels are more common among diabetic individuals.
The median number of pregnancies is higher for individuals with diabetes (outcome ‘1’). The range of pregnancy counts is also wider for this group, indicating greater variability in the number of pregnancies among diabetic individuals.
The median skin thickness is slightly similar for both outcome groups. However, the variability in skin thickness is greater for individuals with diabetes (outcome ‘1’), suggesting more diverse skin thickness measurements among this group.
These box plots highlight the differences in distribution and variability of key health metrics between individuals with and without diabetes. They provide valuable insights into how these variables are associated with the presence of diabetes, helping to identify potential risk factors and areas for further investigation.
A correlation matrix was computed and visualised to quantify the strength and direction of relationships between numeric variables. This analysis provides insights into variables that may influence each other and helps prioritise features for further investigation.
### Correlation Plot Analysis
The correlation plot offers a comprehensive view of how various health-related variables interrelate, providing insights into their mutual influences. This analysis is crucial for understanding which factors might influence others and how they collectively contribute to health outcomes.
Age shows a weak positive correlation with glucose levels, suggesting a tendency for glucose levels to increase slightly with age. This observation shows the importance of age as a factor in understanding metabolic health changes over time. Furthermore, age demonstrates a significant correlation with the number of pregnancies, indicating that older individuals tend to have had more pregnancies throughout their lives.
In this plot, blood pressure exhibits weak correlations with BMI and age. This finding suggests that changes in blood pressure are not strongly influenced by variations in BMI or age within this dataset. However, it highlights the need for further investigation into other potential factors that may impact blood pressure variability.
BMI shows a positive correlation with glucose levels, indicating that individuals with higher BMI tend to have higher glucose levels. This association affirms the link between obesity and metabolic health, where higher BMI can contribute to increased glucose levels. There were no strong correlations observed between BMI and other variables in this analysis.
The diabetes pedigree function does not show significant correlations with any other variables in the plot. This result suggests that genetic predisposition to diabetes, as measured by the pedigree function, operates independently of the other health metrics included in this study. This finding emphasises the complex nature of diabetes susceptibility, involving both genetic and environmental factors.
Glucose levels demonstrate a positive correlation with insulin levels. This relationship indicates that as glucose levels rise, insulin levels tend to increase as well, reflecting the body’s response to maintain glucose homeostasis. Weak correlations were also observed between glucose levels and both BMI and age, suggesting minor associations with these variables.
Insulin levels show weak correlations with skin thickness. This finding suggests that insulin levels may be influenced to a small degree by variations in this health metric. Understanding these relationships can provide insights into insulin regulation and its role in metabolic health.
The number of pregnancies exhibits a strong positive correlation with age. This correlation highlights a natural life course relationship, where older individuals tend to have had more pregnancies. This observation is relevant for understanding reproductive health impacts and potential implications for metabolic health.
Skin thickness shows weak correlations with insulin and BMI. This finding suggests limited associations between skin thickness and these health metrics within the dataset. Further exploration may reveal additional insights into the physiological implications of skin thickness in relation to metabolic health.
The correlation plot analysis provides an understanding of how various health-related variables interact and influence each other.
Preprocessing the data is crucial to ensure that it is clean, consistent, and ready for further analysis. This section outlines the steps taken to prepare the dataset for modeling and analysis.
Zero values in certain variables can sometimes indicate missing data or outliers. It is essential to identify and appropriately handle these values to avoid bias in subsequent analyses.
# Check for zero values in numeric_data
zero_counts <- sapply(numeric_data, function(x) sum(x == 0))
# Print the results
print(zero_counts)## Pregnancies Glucose BloodPressure
## 111 5 35
## SkinThickness Insulin BMI
## 227 374 11
## DiabetesPedigreeFunction Age
## 0 0
Columns such as Glucose, BloodPressure,
SkinThickness, Insulin, and BMI
often contain zero values that are not plausible for these
health-related metrics. These zeros are replaced with NA (Not Available)
to signify missing data, enabling more accurate imputation.
The K-nearest neighbors (KNN) imputation method is employed using the VIM package. This technique fills in missing values based on the values of neighboring data points, ensuring that imputed values are realistic and contextually appropriate for health-related metrics.
After imputation, columns suffixed with _imp are removed
from the dataset. This cleanup step ensures that only the original
variables and their imputed values remain for further analysis, reducing
redundancy and maintaining clarity in the dataset structure.
The “Outcome” variable, which denotes the presence (1) or absence (0) of diabetes, is converted to a factor. This conversion allows for categorical analysis and ensures that the model interprets this variable correctly during predictive modeling and statistical analyses.
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 1 6 148 72 35 175 33.6
## 2 1 85 66 29 55 26.6
## 3 8 183 64 28 325 23.3
## 4 1 89 66 23 94 28.1
## 5 0 137 40 35 168 43.1
## 6 5 116 74 27 112 25.6
## 7 3 78 50 32 88 31.0
## 8 10 115 68 39 122 35.3
## 9 2 197 70 45 543 30.5
## 10 8 125 96 36 150 32.7
## 11 4 110 92 38 105 37.6
## 12 10 168 74 32 171 38.0
## 13 10 139 80 19 135 27.1
## 14 1 189 60 23 846 30.1
## 15 5 166 72 19 175 25.8
## 16 7 100 72 29 130 30.0
## 17 0 118 84 47 230 45.8
## 18 7 107 74 30 115 29.6
## 19 1 103 30 38 83 43.3
## 20 1 115 70 30 96 34.6
## 21 3 126 88 41 235 39.3
## 22 8 99 84 26 105 35.4
## 23 7 196 90 32 280 39.8
## 24 9 119 80 35 130 29.0
## 25 11 143 94 33 146 36.6
## 26 10 125 70 26 115 31.1
## 27 7 147 76 33 304 39.4
## 28 1 97 66 15 140 23.2
## 29 13 145 82 19 110 22.2
## 30 5 117 92 38 75 34.1
## 31 5 109 75 26 105 36.0
## 32 3 158 76 36 245 31.6
## 33 3 88 58 11 54 24.8
## 34 6 92 92 27 51 19.9
## 35 10 122 78 31 100 27.6
## 36 4 103 60 33 192 24.0
## 37 11 138 76 33 122 33.2
## 38 9 102 76 37 175 32.9
## 39 2 90 68 42 129 38.2
## 40 4 111 72 47 207 37.1
## 41 3 180 64 25 70 34.0
## 42 7 133 84 39 235 40.2
## 43 7 106 92 18 135 22.7
## 44 9 171 110 24 240 45.4
## 45 7 159 64 31 193 27.4
## 46 0 180 66 39 465 42.0
## 47 1 146 56 25 128 29.7
## 48 2 71 70 27 50 28.0
## 49 7 103 66 32 130 39.1
## 50 7 105 88 33 89 34.2
## 51 1 103 80 11 82 19.4
## 52 1 101 50 15 36 24.2
## 53 5 88 66 21 23 24.4
## 54 8 176 90 34 300 33.7
## 55 7 150 66 42 342 34.7
## 56 1 73 50 10 54 23.0
## 57 7 187 68 39 304 37.7
## 58 0 100 88 60 110 46.8
## 59 0 146 82 40 272 40.5
## 60 0 105 64 41 142 41.5
## 61 2 84 78 32 89 37.2
## 62 8 133 72 35 125 32.9
## 63 5 44 62 29 54 25.0
## 64 2 141 58 34 128 25.4
## 65 7 114 66 29 156 32.8
## 66 5 99 74 27 100 29.0
## 67 0 109 88 30 110 32.5
## 68 2 109 92 30 190 42.7
## 69 1 95 66 13 38 19.6
## 70 4 146 85 27 100 28.9
## 71 2 100 66 20 90 32.9
## 72 5 139 64 35 140 28.6
## 73 13 126 90 36 150 43.4
## 74 4 129 86 20 270 35.1
## 75 1 79 75 30 50 32.0
## 76 1 107 48 20 100 24.7
## 77 7 62 78 30 71 32.6
## 78 5 95 72 33 75 37.7
## 79 0 131 76 40 230 43.2
## 80 2 112 66 22 94 25.0
## 81 3 113 44 13 86 22.4
## 82 2 74 78 32 89 32.0
## 83 7 83 78 26 71 29.3
## 84 0 101 65 28 94 24.6
## 85 5 137 108 36 220 48.8
## 86 2 110 74 29 125 32.4
## 87 13 106 72 54 105 36.6
## 88 2 100 68 25 71 38.5
## 89 15 136 70 32 110 37.1
## 90 1 107 68 19 110 26.5
## 91 1 80 55 15 76 19.1
## 92 4 123 80 15 176 32.0
## 93 7 81 78 40 48 46.7
## 94 4 134 72 27 175 23.8
## 95 2 142 82 18 64 24.7
## 96 6 144 72 27 228 33.9
## 97 2 92 62 28 87 31.6
## 98 1 71 48 18 76 20.4
## 99 6 93 50 30 64 28.7
## 100 1 122 90 51 220 49.7
## 101 1 163 72 38 185 39.0
## 102 1 151 60 25 168 26.1
## 103 0 125 96 22 110 22.5
## 104 1 81 72 18 40 26.6
## 105 2 85 65 32 77 39.6
## 106 1 126 56 29 152 28.7
## 107 1 96 122 19 55 22.4
## 108 4 144 58 28 140 29.5
## 109 3 83 58 31 18 34.3
## 110 0 95 85 25 36 37.4
## 111 3 171 72 33 135 33.3
## 112 8 155 62 26 495 34.0
## 113 1 89 76 34 37 31.2
## 114 4 76 62 32 71 34.0
## 115 7 160 54 32 175 30.5
## 116 4 146 92 31 285 31.2
## 117 5 124 74 30 115 34.0
## 118 5 78 48 32 71 33.7
## 119 4 97 60 23 49 28.2
## 120 4 99 76 15 51 23.2
## 121 0 162 76 56 100 53.2
## 122 6 111 64 39 94 34.2
## 123 2 107 74 30 100 33.6
## 124 5 132 80 26 135 26.8
## 125 0 113 76 35 96 33.3
## 126 1 88 30 42 99 55.0
## 127 3 120 70 30 135 42.9
## 128 1 118 58 36 94 33.3
## 129 1 117 88 24 145 34.5
## 130 0 105 84 29 180 27.9
## 131 4 173 70 14 168 29.7
## 132 9 122 56 31 171 33.3
## 133 3 170 64 37 225 34.5
## 134 8 84 74 31 71 38.3
## 135 2 96 68 13 49 21.1
## 136 2 125 60 20 140 33.8
## 137 0 100 70 26 50 30.8
## 138 0 93 60 25 92 28.7
## 139 0 129 80 26 205 31.2
## 140 5 105 72 29 325 36.9
## 141 3 128 78 26 112 21.1
## 142 5 106 82 30 75 39.5
## 143 2 108 52 26 63 32.5
## 144 10 108 66 36 130 32.4
## 145 4 154 62 31 284 32.8
## 146 0 102 75 23 89 28.7
## 147 9 57 80 37 49 32.8
## 148 2 106 64 35 119 30.5
## 149 5 147 78 27 168 33.7
## 150 2 90 70 17 53 27.3
## 151 1 136 74 50 204 37.4
## 152 4 114 65 24 74 21.9
## 153 9 156 86 28 155 34.3
## 154 1 153 82 42 485 40.6
## 155 8 188 78 32 280 47.9
## 156 7 152 88 44 210 50.0
## 157 2 99 52 15 94 24.6
## 158 1 109 56 21 135 25.2
## 159 2 88 74 19 53 29.0
## 160 17 163 72 41 114 40.9
## 161 4 151 90 38 140 29.7
## 162 7 102 74 40 105 37.2
## 163 0 114 80 34 285 44.2
## 164 2 100 64 23 87 29.7
## 165 0 131 88 30 145 31.6
## 166 6 104 74 18 156 29.9
## 167 3 148 66 25 284 32.5
## 168 4 120 68 32 152 29.6
## 169 4 110 66 32 88 31.9
## 170 3 111 90 12 78 28.4
## 171 6 102 82 32 160 30.8
## 172 6 134 70 23 130 35.4
## 173 2 87 64 23 50 28.9
## 174 1 79 60 42 48 43.5
## 175 2 75 64 24 55 29.7
## 176 8 179 72 42 130 32.7
## 177 6 85 78 30 71 31.2
## 178 0 129 110 46 130 67.1
## 179 5 143 78 39 108 45.0
## 180 5 130 82 32 110 39.1
## 181 6 87 80 27 100 23.2
## 182 0 119 64 18 92 34.9
## 183 1 89 74 20 23 27.7
## 184 5 73 60 23 49 26.8
## 185 4 141 74 36 126 27.6
## 186 7 194 68 28 280 35.9
## 187 8 181 68 36 495 30.1
## 188 1 128 98 41 58 32.0
## 189 8 109 76 39 114 27.9
## 190 5 139 80 35 160 31.6
## 191 3 111 62 22 86 22.6
## 192 9 123 70 44 94 33.1
## 193 7 159 66 33 325 30.4
## 194 11 135 90 37 150 52.3
## 195 8 85 55 20 54 24.4
## 196 5 158 84 41 210 39.4
## 197 1 105 58 20 100 24.3
## 198 3 107 62 13 48 22.9
## 199 4 109 64 44 99 34.8
## 200 4 148 60 27 318 30.9
## 201 0 113 80 16 82 31.0
## 202 1 138 82 37 160 40.1
## 203 0 108 68 20 73 27.3
## 204 2 99 70 16 44 20.4
## 205 6 103 72 32 190 37.7
## 206 5 111 72 28 110 23.9
## 207 8 196 76 29 280 37.5
## 208 5 162 104 32 231 37.7
## 209 1 96 64 27 87 33.2
## 210 7 184 84 33 277 35.5
## 211 2 81 60 22 49 27.7
## 212 0 147 85 54 255 42.8
## 213 7 179 95 31 168 34.2
## 214 0 140 65 26 130 42.6
## 215 9 112 82 32 175 34.2
## 216 12 151 70 40 271 41.8
## 217 5 109 62 41 129 35.8
## 218 6 125 68 30 120 30.0
## 219 5 85 74 22 122 29.0
## 220 5 112 66 32 129 37.8
## 221 0 177 60 29 478 34.6
## 222 2 158 90 30 165 31.6
## 223 7 119 68 28 112 25.2
## 224 7 142 60 33 190 28.8
## 225 1 100 66 15 56 23.6
## 226 1 87 78 27 32 34.6
## 227 0 101 76 32 92 35.7
## 228 3 162 52 38 194 37.2
## 229 4 197 70 39 744 36.7
## 230 0 117 80 31 53 45.2
## 231 4 142 86 38 160 44.0
## 232 6 134 80 37 370 46.2
## 233 1 79 80 25 37 25.4
## 234 4 122 68 33 130 35.0
## 235 3 74 68 28 45 29.7
## 236 4 171 72 32 225 43.6
## 237 7 181 84 21 192 35.9
## 238 0 179 90 27 185 44.1
## 239 9 164 84 21 165 30.8
## 240 0 104 76 20 70 18.4
## 241 1 91 64 24 87 29.2
## 242 4 91 70 32 88 33.1
## 243 3 139 54 35 160 25.6
## 244 6 119 50 22 176 27.1
## 245 2 146 76 35 194 38.2
## 246 9 184 85 15 156 30.0
## 247 10 122 68 33 122 31.2
## 248 0 165 90 33 680 52.3
## 249 9 124 70 33 402 35.4
## 250 1 111 86 19 116 30.1
## 251 9 106 52 25 83 31.2
## 252 2 129 84 22 110 28.0
## 253 2 90 80 14 55 24.4
## 254 0 86 68 32 100 35.8
## 255 12 92 62 7 258 27.6
## 256 1 113 64 35 96 33.6
## 257 3 111 56 39 94 30.1
## 258 2 114 68 22 105 28.7
## 259 1 193 50 16 375 25.9
## 260 11 155 76 28 150 33.3
## 261 3 191 68 15 130 30.9
## 262 3 141 78 30 190 30.0
## 263 4 95 70 32 88 32.1
## 264 3 142 80 15 155 32.4
## 265 4 123 62 29 165 32.0
## 266 5 96 74 18 67 33.6
## 267 0 138 70 38 167 36.3
## 268 2 128 64 42 158 40.0
## 269 0 102 52 20 100 25.1
## 270 2 146 70 30 135 27.5
## 271 10 101 86 37 155 45.6
## 272 2 108 62 32 56 25.2
## 273 3 122 78 36 112 23.0
## 274 1 71 78 50 45 33.2
## 275 13 106 70 33 180 34.2
## 276 2 100 70 52 57 40.5
## 277 7 106 60 24 129 26.5
## 278 0 104 64 23 116 27.8
## 279 5 114 74 28 135 24.9
## 280 2 108 62 10 278 25.3
## 281 0 146 70 40 167 37.9
## 282 10 129 76 28 122 35.9
## 283 7 133 88 15 155 32.4
## 284 7 161 86 32 165 30.4
## 285 2 108 80 27 140 27.0
## 286 7 136 74 26 135 26.0
## 287 5 155 84 44 545 38.7
## 288 1 119 86 39 220 45.6
## 289 4 96 56 17 49 20.8
## 290 5 108 72 43 75 36.1
## 291 0 78 88 29 40 36.9
## 292 0 107 62 30 74 36.6
## 293 2 128 78 37 182 43.3
## 294 1 128 48 45 194 40.5
## 295 0 161 50 20 168 21.9
## 296 6 151 62 31 120 35.5
## 297 2 146 70 38 360 28.0
## 298 0 126 84 29 215 30.7
## 299 14 100 78 25 184 36.6
## 300 8 112 72 19 135 23.6
## 301 0 167 74 30 245 32.3
## 302 2 144 58 33 135 31.6
## 303 5 77 82 41 42 35.8
## 304 5 115 98 36 220 52.9
## 305 3 150 76 30 140 21.0
## 306 2 120 76 37 105 39.7
## 307 10 161 68 23 132 25.5
## 308 0 137 68 14 148 24.8
## 309 0 128 68 19 180 30.5
## 310 2 124 68 28 205 32.9
## 311 6 80 66 30 71 26.2
## 312 0 106 70 37 148 39.4
## 313 2 155 74 17 96 26.6
## 314 3 113 50 10 85 29.5
## 315 7 109 80 31 156 35.9
## 316 2 112 68 22 94 34.1
## 317 3 99 80 11 64 19.3
## 318 3 182 74 31 135 30.5
## 319 3 115 66 39 140 38.1
## 320 6 194 78 32 300 23.5
## 321 4 129 60 12 231 27.5
## 322 3 112 74 30 135 31.6
## 323 0 124 70 20 115 27.4
## 324 13 152 90 33 29 26.8
## 325 2 112 75 32 135 35.7
## 326 1 157 72 21 168 25.6
## 327 1 122 64 32 156 35.1
## 328 10 179 70 33 122 35.1
## 329 2 102 86 36 120 45.5
## 330 6 105 70 32 68 30.8
## 331 8 118 72 19 87 23.1
## 332 2 87 58 16 52 32.7
## 333 1 180 74 31 180 43.3
## 334 12 106 80 31 100 23.6
## 335 1 95 60 18 58 23.9
## 336 0 165 76 43 255 47.9
## 337 0 117 90 26 196 33.8
## 338 5 115 76 31 156 31.2
## 339 9 152 78 34 171 34.2
## 340 7 178 84 32 225 39.9
## 341 1 130 70 13 105 25.9
## 342 1 95 74 21 73 25.9
## 343 1 93 68 35 77 32.0
## 344 5 122 86 37 105 34.7
## 345 8 95 72 26 105 36.8
## 346 8 126 88 36 108 38.5
## 347 1 139 46 19 83 28.7
## 348 3 116 64 22 105 23.5
## 349 3 99 62 19 74 21.8
## 350 5 116 80 32 175 41.0
## 351 4 92 80 36 105 42.2
## 352 4 137 84 37 130 31.2
## 353 3 61 82 28 76 34.4
## 354 1 90 62 12 43 27.2
## 355 3 90 78 32 88 42.7
## 356 9 165 88 31 165 30.4
## 357 1 125 50 40 167 33.3
## 358 13 129 76 30 150 39.9
## 359 12 88 74 40 54 35.3
## 360 1 196 76 36 249 36.5
## 361 5 189 64 33 325 31.2
## 362 5 158 70 27 168 29.8
## 363 5 103 108 37 108 39.2
## 364 4 146 78 35 300 38.5
## 365 4 147 74 25 293 34.9
## 366 5 99 54 28 83 34.0
## 367 6 124 72 29 130 27.6
## 368 0 101 64 17 82 21.0
## 369 3 81 86 16 66 27.5
## 370 1 133 102 28 140 32.8
## 371 3 173 82 48 465 38.4
## 372 0 118 64 23 89 27.7
## 373 0 84 64 22 66 35.8
## 374 2 105 58 40 94 34.9
## 375 2 122 52 43 158 36.2
## 376 12 140 82 43 325 39.2
## 377 0 98 82 15 84 25.2
## 378 1 87 60 37 75 37.2
## 379 4 156 75 36 277 48.3
## 380 0 93 100 39 72 43.4
## 381 1 107 72 30 82 30.8
## 382 0 105 68 22 58 20.0
## 383 1 109 60 8 182 25.4
## 384 1 90 62 18 59 25.1
## 385 1 125 70 24 110 24.3
## 386 1 119 54 13 50 22.3
## 387 5 116 74 29 156 32.3
## 388 8 105 100 36 215 43.3
## 389 5 144 82 26 285 32.0
## 390 3 100 68 23 81 31.6
## 391 1 100 66 29 196 32.0
## 392 5 166 76 36 210 45.7
## 393 1 131 64 14 415 23.7
## 394 4 116 72 12 87 22.1
## 395 4 158 78 32 205 32.9
## 396 2 127 58 24 275 27.7
## 397 3 96 56 34 115 24.7
## 398 0 131 66 40 165 34.3
## 399 3 82 70 22 44 21.1
## 400 3 193 70 31 225 34.9
## 401 4 95 64 30 115 32.0
## 402 6 137 61 27 190 24.2
## 403 5 136 84 41 88 35.0
## 404 9 72 78 25 68 31.6
## 405 5 168 64 35 225 32.9
## 406 2 123 48 32 165 42.1
## 407 4 115 72 29 122 28.9
## 408 0 101 62 22 58 21.9
## 409 8 197 74 28 225 25.9
## 410 1 172 68 49 579 42.4
## 411 6 102 90 39 77 35.7
## 412 1 112 72 30 176 34.4
## 413 1 143 84 23 310 42.4
## 414 1 143 74 22 61 26.2
## 415 0 138 60 35 167 34.6
## 416 3 173 84 33 474 35.7
## 417 1 97 68 21 81 27.2
## 418 4 144 82 32 210 38.5
## 419 1 83 68 23 49 18.2
## 420 3 129 64 29 115 26.4
## 421 1 119 88 41 170 45.3
## 422 2 94 68 18 76 26.0
## 423 0 102 64 46 78 40.6
## 424 2 115 64 22 106 30.8
## 425 8 151 78 32 210 42.9
## 426 4 184 78 39 277 37.0
## 427 0 94 76 31 115 35.8
## 428 1 181 64 30 180 34.1
## 429 0 135 94 46 145 40.6
## 430 1 95 82 25 180 35.0
## 431 2 99 60 17 74 22.2
## 432 3 89 74 16 85 30.4
## 433 1 80 74 11 60 30.0
## 434 2 139 75 22 110 25.6
## 435 1 90 68 8 70 24.5
## 436 0 141 75 40 230 42.4
## 437 12 140 85 33 108 37.4
## 438 5 147 75 27 126 29.9
## 439 1 97 70 15 46 18.2
## 440 6 107 88 33 230 36.8
## 441 0 189 104 25 145 34.3
## 442 2 83 66 23 50 32.2
## 443 4 117 64 27 120 33.2
## 444 8 108 70 31 156 30.5
## 445 4 117 62 12 115 29.7
## 446 0 180 78 63 14 59.4
## 447 1 100 72 12 70 25.3
## 448 0 95 80 45 92 36.5
## 449 0 104 64 37 64 33.6
## 450 0 120 74 18 63 30.5
## 451 1 82 64 13 95 21.2
## 452 2 134 70 30 190 28.9
## 453 0 91 68 32 210 39.9
## 454 2 119 74 26 73 19.6
## 455 2 100 54 28 105 37.8
## 456 14 175 62 30 132 33.6
## 457 1 135 54 26 152 26.7
## 458 5 86 68 28 71 30.2
## 459 10 148 84 48 237 37.6
## 460 9 134 74 33 60 25.9
## 461 9 120 72 22 56 20.8
## 462 1 71 62 18 41 21.8
## 463 8 74 70 40 49 35.3
## 464 5 88 78 30 68 27.6
## 465 10 115 98 28 110 24.0
## 466 0 124 56 13 105 21.8
## 467 0 74 52 10 36 27.8
## 468 0 97 64 36 100 36.8
## 469 8 120 74 32 130 30.0
## 470 6 154 78 41 140 46.1
## 471 1 144 82 40 194 41.3
## 472 0 137 70 38 135 33.2
## 473 0 119 66 27 142 38.8
## 474 7 136 90 31 135 29.9
## 475 4 114 64 22 120 28.9
## 476 0 137 84 27 120 27.3
## 477 2 105 80 45 191 33.7
## 478 7 114 76 17 110 23.8
## 479 8 126 74 38 75 25.9
## 480 4 132 86 31 135 28.0
## 481 3 158 70 30 328 35.5
## 482 0 123 88 37 105 35.2
## 483 4 85 58 22 49 27.8
## 484 0 84 82 31 125 38.2
## 485 0 145 80 36 220 44.2
## 486 0 135 68 42 250 42.3
## 487 1 139 62 41 480 40.7
## 488 0 173 78 32 265 46.5
## 489 4 99 72 17 51 25.6
## 490 8 194 80 31 135 26.1
## 491 2 83 65 28 66 36.8
## 492 2 89 90 30 100 33.5
## 493 4 99 68 38 88 32.8
## 494 4 125 70 18 122 28.9
## 495 3 80 78 32 56 32.0
## 496 6 166 74 27 168 26.6
## 497 5 110 68 27 100 26.0
## 498 2 81 72 15 76 30.1
## 499 7 195 70 33 145 25.1
## 500 6 154 74 32 193 29.3
## 501 2 117 90 19 71 25.2
## 502 3 84 72 32 77 37.2
## 503 6 144 68 41 215 39.0
## 504 7 94 64 25 79 33.3
## 505 3 96 78 39 105 37.3
## 506 10 75 82 30 49 33.3
## 507 0 180 90 26 90 36.5
## 508 1 130 60 23 170 28.6
## 509 2 84 50 23 76 30.4
## 510 8 120 78 26 60 25.0
## 511 12 84 72 31 175 29.7
## 512 0 139 62 17 210 22.1
## 513 9 91 68 18 126 24.2
## 514 2 91 62 23 50 27.3
## 515 3 99 54 19 86 25.6
## 516 3 163 70 18 105 31.6
## 517 9 145 88 34 165 30.3
## 518 7 125 86 30 108 37.6
## 519 13 76 60 37 105 32.8
## 520 6 129 90 7 326 19.6
## 521 2 68 70 32 66 25.0
## 522 3 124 80 33 130 33.2
## 523 6 114 92 36 170 34.7
## 524 9 130 70 35 144 34.2
## 525 3 125 58 24 158 31.6
## 526 3 87 60 18 58 21.8
## 527 1 97 64 19 82 18.2
## 528 3 116 74 15 105 26.3
## 529 0 117 66 31 188 30.8
## 530 0 111 65 22 73 24.6
## 531 2 122 60 18 106 29.8
## 532 0 107 76 32 148 45.3
## 533 1 86 66 52 65 41.3
## 534 6 91 78 28 71 29.8
## 535 1 77 56 30 56 33.3
## 536 4 132 62 35 135 32.9
## 537 0 105 90 27 105 29.6
## 538 0 57 60 20 56 21.7
## 539 0 127 80 37 210 36.3
## 540 3 129 92 49 155 36.4
## 541 8 100 74 40 215 39.4
## 542 3 128 72 25 190 32.4
## 543 10 90 85 32 165 34.9
## 544 4 84 90 23 56 39.5
## 545 1 88 78 29 76 32.0
## 546 8 186 90 35 225 34.5
## 547 5 187 76 27 207 43.6
## 548 4 131 68 21 166 33.1
## 549 1 164 82 43 67 32.8
## 550 4 189 110 31 130 28.5
## 551 1 116 70 28 110 27.4
## 552 3 84 68 30 106 31.9
## 553 6 114 88 18 155 27.8
## 554 1 88 62 24 44 29.9
## 555 1 84 64 23 115 36.9
## 556 7 124 70 33 215 25.5
## 557 1 97 70 40 90 38.1
## 558 8 110 76 18 135 27.8
## 559 11 103 68 40 94 46.2
## 560 11 85 74 27 105 30.1
## 561 6 125 76 32 370 33.8
## 562 0 198 66 32 274 41.3
## 563 1 87 68 34 77 37.6
## 564 6 99 60 19 54 26.9
## 565 0 91 80 31 100 32.4
## 566 2 95 54 14 88 26.1
## 567 1 99 72 30 18 38.6
## 568 6 92 62 32 126 32.0
## 569 4 154 72 29 126 31.3
## 570 0 121 66 30 165 34.3
## 571 3 78 70 32 55 32.5
## 572 2 130 96 22 110 22.6
## 573 3 111 58 31 44 29.5
## 574 2 98 60 17 120 34.7
## 575 1 143 86 30 330 30.1
## 576 1 119 44 47 63 35.5
## 577 6 108 44 20 130 24.0
## 578 2 118 80 35 182 42.9
## 579 10 133 68 31 122 27.0
## 580 2 197 70 99 495 34.7
## 581 0 151 90 46 230 42.1
## 582 6 109 60 27 64 25.0
## 583 12 121 78 17 110 26.5
## 584 8 100 76 39 105 38.7
## 585 8 124 76 24 600 28.7
## 586 1 93 56 11 58 22.5
## 587 8 143 66 36 304 34.9
## 588 6 103 66 27 68 24.3
## 589 3 176 86 27 156 33.3
## 590 0 73 62 17 41 21.1
## 591 11 111 84 40 215 46.8
## 592 2 112 78 50 140 39.4
## 593 3 132 80 32 140 34.4
## 594 2 82 52 22 115 28.5
## 595 6 123 72 45 230 33.6
## 596 0 188 82 14 185 32.0
## 597 0 67 76 32 125 45.3
## 598 1 89 24 19 25 27.8
## 599 1 173 74 31 180 36.8
## 600 1 109 38 18 120 23.1
## 601 1 108 88 19 84 27.1
## 602 6 96 74 27 100 23.7
## 603 1 124 74 36 110 27.8
## 604 7 150 78 29 126 35.2
## 605 4 183 66 28 180 28.4
## 606 1 124 60 32 176 35.8
## 607 1 181 78 42 293 40.0
## 608 1 92 62 25 41 19.5
## 609 0 152 82 39 272 41.5
## 610 1 111 62 13 182 24.0
## 611 3 106 54 21 158 30.9
## 612 3 174 58 22 194 32.9
## 613 7 168 88 42 321 38.2
## 614 6 105 80 28 82 32.5
## 615 11 138 74 26 144 36.1
## 616 3 106 72 22 100 25.8
## 617 6 117 96 38 100 28.7
## 618 2 68 62 13 15 20.1
## 619 9 112 82 24 155 28.2
## 620 0 119 70 30 74 32.4
## 621 2 112 86 42 160 38.4
## 622 2 92 76 20 81 24.2
## 623 6 183 94 31 193 40.8
## 624 0 94 70 27 115 43.5
## 625 2 108 64 23 94 30.8
## 626 4 90 88 47 54 37.7
## 627 0 125 68 22 148 24.7
## 628 0 132 78 26 188 32.4
## 629 5 128 80 39 105 34.6
## 630 4 94 65 22 74 24.7
## 631 7 114 64 29 156 27.4
## 632 0 102 78 40 90 34.5
## 633 2 111 60 23 116 26.2
## 634 1 128 82 17 183 27.5
## 635 10 92 62 27 54 25.9
## 636 13 104 72 31 130 31.2
## 637 5 104 74 30 105 28.8
## 638 2 94 76 18 66 31.6
## 639 7 97 76 32 91 40.9
## 640 1 100 74 12 46 19.5
## 641 0 102 86 17 105 29.3
## 642 4 128 70 32 130 34.3
## 643 6 147 80 31 285 29.5
## 644 4 90 66 23 54 28.0
## 645 3 103 72 30 152 27.6
## 646 2 157 74 35 440 39.4
## 647 1 167 74 17 144 23.4
## 648 0 179 50 36 159 37.8
## 649 11 136 84 35 130 28.3
## 650 0 107 60 25 116 26.4
## 651 1 91 54 25 100 25.2
## 652 1 117 60 23 106 33.8
## 653 5 123 74 40 77 34.1
## 654 2 120 54 22 106 26.8
## 655 1 106 70 28 135 34.2
## 656 2 155 52 27 540 38.7
## 657 2 101 58 35 90 21.8
## 658 1 120 80 48 200 38.9
## 659 11 127 106 33 105 39.0
## 660 3 80 82 31 70 34.2
## 661 10 162 84 31 110 27.7
## 662 1 199 76 43 274 42.9
## 663 8 167 106 46 231 37.6
## 664 9 145 80 46 130 37.9
## 665 6 115 60 39 125 33.7
## 666 1 112 80 45 132 34.8
## 667 4 145 82 18 175 32.5
## 668 10 111 70 27 130 27.5
## 669 6 98 58 33 190 34.0
## 670 9 154 78 30 100 30.9
## 671 6 165 68 26 168 33.6
## 672 1 99 58 10 94 25.4
## 673 10 68 106 23 49 35.5
## 674 3 123 100 35 240 57.3
## 675 8 91 82 26 108 35.6
## 676 6 195 70 28 200 30.9
## 677 9 156 86 32 145 24.8
## 678 0 93 60 32 87 35.3
## 679 3 121 52 35 129 36.0
## 680 2 101 58 17 265 24.2
## 681 2 56 56 28 45 24.2
## 682 0 162 76 36 130 49.6
## 683 0 95 64 39 105 44.6
## 684 4 125 80 30 160 32.3
## 685 5 136 82 26 135 28.0
## 686 2 129 74 26 205 33.2
## 687 3 130 64 22 210 23.1
## 688 1 107 50 19 100 28.3
## 689 1 140 74 26 180 24.1
## 690 1 144 82 46 180 46.1
## 691 8 107 80 28 110 24.6
## 692 13 158 114 32 146 42.3
## 693 2 121 70 32 95 39.1
## 694 7 129 68 49 125 38.5
## 695 2 90 60 22 74 23.5
## 696 7 142 90 24 480 30.4
## 697 3 169 74 19 125 29.9
## 698 0 99 62 22 94 25.0
## 699 4 127 88 11 155 34.5
## 700 4 118 70 32 135 44.5
## 701 2 122 76 27 200 35.9
## 702 6 125 78 31 175 27.6
## 703 1 168 88 29 156 35.0
## 704 2 129 86 37 105 38.5
## 705 4 110 76 20 100 28.4
## 706 6 80 80 36 54 39.8
## 707 10 115 96 36 175 34.2
## 708 2 127 46 21 335 34.4
## 709 9 164 78 32 132 32.8
## 710 2 93 64 32 160 38.0
## 711 3 158 64 13 387 31.2
## 712 5 126 78 27 22 29.6
## 713 10 129 62 36 130 41.2
## 714 0 134 58 20 291 26.4
## 715 3 102 74 27 105 29.5
## 716 7 187 50 33 392 33.9
## 717 3 173 78 39 185 33.8
## 718 10 94 72 18 110 23.1
## 719 1 108 60 46 178 35.5
## 720 5 97 76 27 180 35.6
## 721 4 83 86 19 66 29.3
## 722 1 114 66 36 200 38.1
## 723 1 149 68 29 127 29.3
## 724 5 117 86 30 105 39.1
## 725 1 111 94 30 160 32.8
## 726 4 112 78 40 105 39.4
## 727 1 116 78 29 180 36.1
## 728 0 141 84 26 205 32.4
## 729 2 175 88 25 71 22.9
## 730 2 92 52 23 86 30.1
## 731 3 130 78 23 79 28.4
## 732 8 120 86 27 115 28.4
## 733 2 174 88 37 120 44.5
## 734 2 106 56 27 165 29.0
## 735 2 105 75 20 87 23.3
## 736 4 95 60 32 83 35.4
## 737 0 126 86 27 120 27.4
## 738 8 65 72 23 71 32.0
## 739 2 99 60 17 160 36.6
## 740 1 102 74 32 145 39.5
## 741 11 120 80 37 150 42.3
## 742 3 102 44 20 94 30.8
## 743 1 109 58 18 116 28.5
## 744 9 140 94 35 146 32.7
## 745 13 153 88 37 140 40.6
## 746 12 100 84 33 105 30.0
## 747 1 147 94 41 220 49.3
## 748 1 81 74 41 57 46.3
## 749 3 187 70 22 200 36.4
## 750 6 162 62 35 175 24.3
## 751 4 136 70 29 190 31.2
## 752 1 121 78 39 74 39.0
## 753 3 108 62 24 86 26.0
## 754 0 181 88 44 510 43.3
## 755 8 154 78 32 210 32.4
## 756 1 128 88 39 110 36.5
## 757 7 137 90 41 94 32.0
## 758 0 123 72 35 145 36.3
## 759 1 106 76 32 90 37.5
## 760 6 190 92 33 225 35.5
## 761 2 88 58 26 16 28.4
## 762 9 170 74 31 225 44.0
## 763 9 89 62 27 54 22.5
## 764 10 101 76 48 180 32.9
## 765 2 122 70 27 180 36.8
## 766 5 121 72 23 112 26.2
## 767 1 126 60 31 140 30.1
## 768 1 93 70 31 44 30.4
## DiabetesPedigreeFunction Age Outcome
## 1 0.627 50 1
## 2 0.351 31 0
## 3 0.672 32 1
## 4 0.167 21 0
## 5 2.288 33 1
## 6 0.201 30 0
## 7 0.248 26 1
## 8 0.134 29 0
## 9 0.158 53 1
## 10 0.232 54 1
## 11 0.191 30 0
## 12 0.537 34 1
## 13 1.441 57 0
## 14 0.398 59 1
## 15 0.587 51 1
## 16 0.484 32 1
## 17 0.551 31 1
## 18 0.254 31 1
## 19 0.183 33 0
## 20 0.529 32 1
## 21 0.704 27 0
## 22 0.388 50 0
## 23 0.451 41 1
## 24 0.263 29 1
## 25 0.254 51 1
## 26 0.205 41 1
## 27 0.257 43 1
## 28 0.487 22 0
## 29 0.245 57 0
## 30 0.337 38 0
## 31 0.546 60 0
## 32 0.851 28 1
## 33 0.267 22 0
## 34 0.188 28 0
## 35 0.512 45 0
## 36 0.966 33 0
## 37 0.420 35 0
## 38 0.665 46 1
## 39 0.503 27 1
## 40 1.390 56 1
## 41 0.271 26 0
## 42 0.696 37 0
## 43 0.235 48 0
## 44 0.721 54 1
## 45 0.294 40 0
## 46 1.893 25 1
## 47 0.564 29 0
## 48 0.586 22 0
## 49 0.344 31 1
## 50 0.305 24 0
## 51 0.491 22 0
## 52 0.526 26 0
## 53 0.342 30 0
## 54 0.467 58 1
## 55 0.718 42 0
## 56 0.248 21 0
## 57 0.254 41 1
## 58 0.962 31 0
## 59 1.781 44 0
## 60 0.173 22 0
## 61 0.304 21 0
## 62 0.270 39 1
## 63 0.587 36 0
## 64 0.699 24 0
## 65 0.258 42 1
## 66 0.203 32 0
## 67 0.855 38 1
## 68 0.845 54 0
## 69 0.334 25 0
## 70 0.189 27 0
## 71 0.867 28 1
## 72 0.411 26 0
## 73 0.583 42 1
## 74 0.231 23 0
## 75 0.396 22 0
## 76 0.140 22 0
## 77 0.391 41 0
## 78 0.370 27 0
## 79 0.270 26 1
## 80 0.307 24 0
## 81 0.140 22 0
## 82 0.102 22 0
## 83 0.767 36 0
## 84 0.237 22 0
## 85 0.227 37 1
## 86 0.698 27 0
## 87 0.178 45 0
## 88 0.324 26 0
## 89 0.153 43 1
## 90 0.165 24 0
## 91 0.258 21 0
## 92 0.443 34 0
## 93 0.261 42 0
## 94 0.277 60 1
## 95 0.761 21 0
## 96 0.255 40 0
## 97 0.130 24 0
## 98 0.323 22 0
## 99 0.356 23 0
## 100 0.325 31 1
## 101 1.222 33 1
## 102 0.179 22 0
## 103 0.262 21 0
## 104 0.283 24 0
## 105 0.930 27 0
## 106 0.801 21 0
## 107 0.207 27 0
## 108 0.287 37 0
## 109 0.336 25 0
## 110 0.247 24 1
## 111 0.199 24 1
## 112 0.543 46 1
## 113 0.192 23 0
## 114 0.391 25 0
## 115 0.588 39 1
## 116 0.539 61 1
## 117 0.220 38 1
## 118 0.654 25 0
## 119 0.443 22 0
## 120 0.223 21 0
## 121 0.759 25 1
## 122 0.260 24 0
## 123 0.404 23 0
## 124 0.186 69 0
## 125 0.278 23 1
## 126 0.496 26 1
## 127 0.452 30 0
## 128 0.261 23 0
## 129 0.403 40 1
## 130 0.741 62 1
## 131 0.361 33 1
## 132 1.114 33 1
## 133 0.356 30 1
## 134 0.457 39 0
## 135 0.647 26 0
## 136 0.088 31 0
## 137 0.597 21 0
## 138 0.532 22 0
## 139 0.703 29 0
## 140 0.159 28 0
## 141 0.268 55 0
## 142 0.286 38 0
## 143 0.318 22 0
## 144 0.272 42 1
## 145 0.237 23 0
## 146 0.572 21 0
## 147 0.096 41 0
## 148 1.400 34 0
## 149 0.218 65 0
## 150 0.085 22 0
## 151 0.399 24 0
## 152 0.432 37 0
## 153 1.189 42 1
## 154 0.687 23 0
## 155 0.137 43 1
## 156 0.337 36 1
## 157 0.637 21 0
## 158 0.833 23 0
## 159 0.229 22 0
## 160 0.817 47 1
## 161 0.294 36 0
## 162 0.204 45 0
## 163 0.167 27 0
## 164 0.368 21 0
## 165 0.743 32 1
## 166 0.722 41 1
## 167 0.256 22 0
## 168 0.709 34 0
## 169 0.471 29 0
## 170 0.495 29 0
## 171 0.180 36 1
## 172 0.542 29 1
## 173 0.773 25 0
## 174 0.678 23 0
## 175 0.370 33 0
## 176 0.719 36 1
## 177 0.382 42 0
## 178 0.319 26 1
## 179 0.190 47 0
## 180 0.956 37 1
## 181 0.084 32 0
## 182 0.725 23 0
## 183 0.299 21 0
## 184 0.268 27 0
## 185 0.244 40 0
## 186 0.745 41 1
## 187 0.615 60 1
## 188 1.321 33 1
## 189 0.640 31 1
## 190 0.361 25 1
## 191 0.142 21 0
## 192 0.374 40 0
## 193 0.383 36 1
## 194 0.578 40 1
## 195 0.136 42 0
## 196 0.395 29 1
## 197 0.187 21 0
## 198 0.678 23 1
## 199 0.905 26 1
## 200 0.150 29 1
## 201 0.874 21 0
## 202 0.236 28 0
## 203 0.787 32 0
## 204 0.235 27 0
## 205 0.324 55 0
## 206 0.407 27 0
## 207 0.605 57 1
## 208 0.151 52 1
## 209 0.289 21 0
## 210 0.355 41 1
## 211 0.290 25 0
## 212 0.375 24 0
## 213 0.164 60 0
## 214 0.431 24 1
## 215 0.260 36 1
## 216 0.742 38 1
## 217 0.514 25 1
## 218 0.464 32 0
## 219 1.224 32 1
## 220 0.261 41 1
## 221 1.072 21 1
## 222 0.805 66 1
## 223 0.209 37 0
## 224 0.687 61 0
## 225 0.666 26 0
## 226 0.101 22 0
## 227 0.198 26 0
## 228 0.652 24 1
## 229 2.329 31 0
## 230 0.089 24 0
## 231 0.645 22 1
## 232 0.238 46 1
## 233 0.583 22 0
## 234 0.394 29 0
## 235 0.293 23 0
## 236 0.479 26 1
## 237 0.586 51 1
## 238 0.686 23 1
## 239 0.831 32 1
## 240 0.582 27 0
## 241 0.192 21 0
## 242 0.446 22 0
## 243 0.402 22 1
## 244 1.318 33 1
## 245 0.329 29 0
## 246 1.213 49 1
## 247 0.258 41 0
## 248 0.427 23 0
## 249 0.282 34 0
## 250 0.143 23 0
## 251 0.380 42 0
## 252 0.284 27 0
## 253 0.249 24 0
## 254 0.238 25 0
## 255 0.926 44 1
## 256 0.543 21 1
## 257 0.557 30 0
## 258 0.092 25 0
## 259 0.655 24 0
## 260 1.353 51 1
## 261 0.299 34 0
## 262 0.761 27 1
## 263 0.612 24 0
## 264 0.200 63 0
## 265 0.226 35 1
## 266 0.997 43 0
## 267 0.933 25 1
## 268 1.101 24 0
## 269 0.078 21 0
## 270 0.240 28 1
## 271 1.136 38 1
## 272 0.128 21 0
## 273 0.254 40 0
## 274 0.422 21 0
## 275 0.251 52 0
## 276 0.677 25 0
## 277 0.296 29 1
## 278 0.454 23 0
## 279 0.744 57 0
## 280 0.881 22 0
## 281 0.334 28 1
## 282 0.280 39 0
## 283 0.262 37 0
## 284 0.165 47 1
## 285 0.259 52 1
## 286 0.647 51 0
## 287 0.619 34 0
## 288 0.808 29 1
## 289 0.340 26 0
## 290 0.263 33 0
## 291 0.434 21 0
## 292 0.757 25 1
## 293 1.224 31 1
## 294 0.613 24 1
## 295 0.254 65 0
## 296 0.692 28 0
## 297 0.337 29 1
## 298 0.520 24 0
## 299 0.412 46 1
## 300 0.840 58 0
## 301 0.839 30 1
## 302 0.422 25 1
## 303 0.156 35 0
## 304 0.209 28 1
## 305 0.207 37 0
## 306 0.215 29 0
## 307 0.326 47 1
## 308 0.143 21 0
## 309 1.391 25 1
## 310 0.875 30 1
## 311 0.313 41 0
## 312 0.605 22 0
## 313 0.433 27 1
## 314 0.626 25 0
## 315 1.127 43 1
## 316 0.315 26 0
## 317 0.284 30 0
## 318 0.345 29 1
## 319 0.150 28 0
## 320 0.129 59 1
## 321 0.527 31 0
## 322 0.197 25 1
## 323 0.254 36 1
## 324 0.731 43 1
## 325 0.148 21 0
## 326 0.123 24 0
## 327 0.692 30 1
## 328 0.200 37 0
## 329 0.127 23 1
## 330 0.122 37 0
## 331 1.476 46 0
## 332 0.166 25 0
## 333 0.282 41 1
## 334 0.137 44 0
## 335 0.260 22 0
## 336 0.259 26 0
## 337 0.932 44 0
## 338 0.343 44 1
## 339 0.893 33 1
## 340 0.331 41 1
## 341 0.472 22 0
## 342 0.673 36 0
## 343 0.389 22 0
## 344 0.290 33 0
## 345 0.485 57 0
## 346 0.349 49 0
## 347 0.654 22 0
## 348 0.187 23 0
## 349 0.279 26 0
## 350 0.346 37 1
## 351 0.237 29 0
## 352 0.252 30 0
## 353 0.243 46 0
## 354 0.580 24 0
## 355 0.559 21 0
## 356 0.302 49 1
## 357 0.962 28 1
## 358 0.569 44 1
## 359 0.378 48 0
## 360 0.875 29 1
## 361 0.583 29 1
## 362 0.207 63 0
## 363 0.305 65 0
## 364 0.520 67 1
## 365 0.385 30 0
## 366 0.499 30 0
## 367 0.368 29 1
## 368 0.252 21 0
## 369 0.306 22 0
## 370 0.234 45 1
## 371 2.137 25 1
## 372 1.731 21 0
## 373 0.545 21 0
## 374 0.225 25 0
## 375 0.816 28 0
## 376 0.528 58 1
## 377 0.299 22 0
## 378 0.509 22 0
## 379 0.238 32 1
## 380 1.021 35 0
## 381 0.821 24 0
## 382 0.236 22 0
## 383 0.947 21 0
## 384 1.268 25 0
## 385 0.221 25 0
## 386 0.205 24 0
## 387 0.660 35 1
## 388 0.239 45 1
## 389 0.452 58 1
## 390 0.949 28 0
## 391 0.444 42 0
## 392 0.340 27 1
## 393 0.389 21 0
## 394 0.463 37 0
## 395 0.803 31 1
## 396 1.600 25 0
## 397 0.944 39 0
## 398 0.196 22 1
## 399 0.389 25 0
## 400 0.241 25 1
## 401 0.161 31 1
## 402 0.151 55 0
## 403 0.286 35 1
## 404 0.280 38 0
## 405 0.135 41 1
## 406 0.520 26 0
## 407 0.376 46 1
## 408 0.336 25 0
## 409 1.191 39 1
## 410 0.702 28 1
## 411 0.674 28 0
## 412 0.528 25 0
## 413 1.076 22 0
## 414 0.256 21 0
## 415 0.534 21 1
## 416 0.258 22 1
## 417 1.095 22 0
## 418 0.554 37 1
## 419 0.624 27 0
## 420 0.219 28 1
## 421 0.507 26 0
## 422 0.561 21 0
## 423 0.496 21 0
## 424 0.421 21 0
## 425 0.516 36 1
## 426 0.264 31 1
## 427 0.256 25 0
## 428 0.328 38 1
## 429 0.284 26 0
## 430 0.233 43 1
## 431 0.108 23 0
## 432 0.551 38 0
## 433 0.527 22 0
## 434 0.167 29 0
## 435 1.138 36 0
## 436 0.205 29 1
## 437 0.244 41 0
## 438 0.434 28 0
## 439 0.147 21 0
## 440 0.727 31 0
## 441 0.435 41 1
## 442 0.497 22 0
## 443 0.230 24 0
## 444 0.955 33 1
## 445 0.380 30 1
## 446 2.420 25 1
## 447 0.658 28 0
## 448 0.330 26 0
## 449 0.510 22 1
## 450 0.285 26 0
## 451 0.415 23 0
## 452 0.542 23 1
## 453 0.381 25 0
## 454 0.832 72 0
## 455 0.498 24 0
## 456 0.212 38 1
## 457 0.687 62 0
## 458 0.364 24 0
## 459 1.001 51 1
## 460 0.460 81 0
## 461 0.733 48 0
## 462 0.416 26 0
## 463 0.705 39 0
## 464 0.258 37 0
## 465 1.022 34 0
## 466 0.452 21 0
## 467 0.269 22 0
## 468 0.600 25 0
## 469 0.183 38 1
## 470 0.571 27 0
## 471 0.607 28 0
## 472 0.170 22 0
## 473 0.259 22 0
## 474 0.210 50 0
## 475 0.126 24 0
## 476 0.231 59 0
## 477 0.711 29 1
## 478 0.466 31 0
## 479 0.162 39 0
## 480 0.419 63 0
## 481 0.344 35 1
## 482 0.197 29 0
## 483 0.306 28 0
## 484 0.233 23 0
## 485 0.630 31 1
## 486 0.365 24 1
## 487 0.536 21 0
## 488 1.159 58 0
## 489 0.294 28 0
## 490 0.551 67 0
## 491 0.629 24 0
## 492 0.292 42 0
## 493 0.145 33 0
## 494 1.144 45 1
## 495 0.174 22 0
## 496 0.304 66 0
## 497 0.292 30 0
## 498 0.547 25 0
## 499 0.163 55 1
## 500 0.839 39 0
## 501 0.313 21 0
## 502 0.267 28 0
## 503 0.727 41 1
## 504 0.738 41 0
## 505 0.238 40 0
## 506 0.263 38 0
## 507 0.314 35 1
## 508 0.692 21 0
## 509 0.968 21 0
## 510 0.409 64 0
## 511 0.297 46 1
## 512 0.207 21 0
## 513 0.200 58 0
## 514 0.525 22 0
## 515 0.154 24 0
## 516 0.268 28 1
## 517 0.771 53 1
## 518 0.304 51 0
## 519 0.180 41 0
## 520 0.582 60 0
## 521 0.187 25 0
## 522 0.305 26 0
## 523 0.189 26 0
## 524 0.652 45 1
## 525 0.151 24 0
## 526 0.444 21 0
## 527 0.299 21 0
## 528 0.107 24 0
## 529 0.493 22 0
## 530 0.660 31 0
## 531 0.717 22 0
## 532 0.686 24 0
## 533 0.917 29 0
## 534 0.501 31 0
## 535 1.251 24 0
## 536 0.302 23 1
## 537 0.197 46 0
## 538 0.735 67 0
## 539 0.804 23 0
## 540 0.968 32 1
## 541 0.661 43 1
## 542 0.549 27 1
## 543 0.825 56 1
## 544 0.159 25 0
## 545 0.365 29 0
## 546 0.423 37 1
## 547 1.034 53 1
## 548 0.160 28 0
## 549 0.341 50 0
## 550 0.680 37 0
## 551 0.204 21 0
## 552 0.591 25 0
## 553 0.247 66 0
## 554 0.422 23 0
## 555 0.471 28 0
## 556 0.161 37 0
## 557 0.218 30 0
## 558 0.237 58 0
## 559 0.126 42 0
## 560 0.300 35 0
## 561 0.121 54 1
## 562 0.502 28 1
## 563 0.401 24 0
## 564 0.497 32 0
## 565 0.601 27 0
## 566 0.748 22 0
## 567 0.412 21 0
## 568 0.085 46 0
## 569 0.338 37 0
## 570 0.203 33 1
## 571 0.270 39 0
## 572 0.268 21 0
## 573 0.430 22 0
## 574 0.198 22 0
## 575 0.892 23 0
## 576 0.280 25 0
## 577 0.813 35 0
## 578 0.693 21 1
## 579 0.245 36 0
## 580 0.575 62 1
## 581 0.371 21 1
## 582 0.206 27 0
## 583 0.259 62 0
## 584 0.190 42 0
## 585 0.687 52 1
## 586 0.417 22 0
## 587 0.129 41 1
## 588 0.249 29 0
## 589 1.154 52 1
## 590 0.342 25 0
## 591 0.925 45 1
## 592 0.175 24 0
## 593 0.402 44 1
## 594 1.699 25 0
## 595 0.733 34 0
## 596 0.682 22 1
## 597 0.194 46 0
## 598 0.559 21 0
## 599 0.088 38 1
## 600 0.407 26 0
## 601 0.400 24 0
## 602 0.190 28 0
## 603 0.100 30 0
## 604 0.692 54 1
## 605 0.212 36 1
## 606 0.514 21 0
## 607 1.258 22 1
## 608 0.482 25 0
## 609 0.270 27 0
## 610 0.138 23 0
## 611 0.292 24 0
## 612 0.593 36 1
## 613 0.787 40 1
## 614 0.878 26 0
## 615 0.557 50 1
## 616 0.207 27 0
## 617 0.157 30 0
## 618 0.257 23 0
## 619 1.282 50 1
## 620 0.141 24 1
## 621 0.246 28 0
## 622 1.698 28 0
## 623 1.461 45 0
## 624 0.347 21 0
## 625 0.158 21 0
## 626 0.362 29 0
## 627 0.206 21 0
## 628 0.393 21 0
## 629 0.144 45 0
## 630 0.148 21 0
## 631 0.732 34 1
## 632 0.238 24 0
## 633 0.343 23 0
## 634 0.115 22 0
## 635 0.167 31 0
## 636 0.465 38 1
## 637 0.153 48 0
## 638 0.649 23 0
## 639 0.871 32 1
## 640 0.149 28 0
## 641 0.695 27 0
## 642 0.303 24 0
## 643 0.178 50 1
## 644 0.610 31 0
## 645 0.730 27 0
## 646 0.134 30 0
## 647 0.447 33 1
## 648 0.455 22 1
## 649 0.260 42 1
## 650 0.133 23 0
## 651 0.234 23 0
## 652 0.466 27 0
## 653 0.269 28 0
## 654 0.455 27 0
## 655 0.142 22 0
## 656 0.240 25 1
## 657 0.155 22 0
## 658 1.162 41 0
## 659 0.190 51 0
## 660 1.292 27 1
## 661 0.182 54 0
## 662 1.394 22 1
## 663 0.165 43 1
## 664 0.637 40 1
## 665 0.245 40 1
## 666 0.217 24 0
## 667 0.235 70 1
## 668 0.141 40 1
## 669 0.430 43 0
## 670 0.164 45 0
## 671 0.631 49 0
## 672 0.551 21 0
## 673 0.285 47 0
## 674 0.880 22 0
## 675 0.587 68 0
## 676 0.328 31 1
## 677 0.230 53 1
## 678 0.263 25 0
## 679 0.127 25 1
## 680 0.614 23 0
## 681 0.332 22 0
## 682 0.364 26 1
## 683 0.366 22 0
## 684 0.536 27 1
## 685 0.640 69 0
## 686 0.591 25 0
## 687 0.314 22 0
## 688 0.181 29 0
## 689 0.828 23 0
## 690 0.335 46 1
## 691 0.856 34 0
## 692 0.257 44 1
## 693 0.886 23 0
## 694 0.439 43 1
## 695 0.191 25 0
## 696 0.128 43 1
## 697 0.268 31 1
## 698 0.253 22 0
## 699 0.598 28 0
## 700 0.904 26 0
## 701 0.483 26 0
## 702 0.565 49 1
## 703 0.905 52 1
## 704 0.304 41 0
## 705 0.118 27 0
## 706 0.177 28 0
## 707 0.261 30 1
## 708 0.176 22 0
## 709 0.148 45 1
## 710 0.674 23 1
## 711 0.295 24 0
## 712 0.439 40 0
## 713 0.441 38 1
## 714 0.352 21 0
## 715 0.121 32 0
## 716 0.826 34 1
## 717 0.970 31 1
## 718 0.595 56 0
## 719 0.415 24 0
## 720 0.378 52 1
## 721 0.317 34 0
## 722 0.289 21 0
## 723 0.349 42 1
## 724 0.251 42 0
## 725 0.265 45 0
## 726 0.236 38 0
## 727 0.496 25 0
## 728 0.433 22 0
## 729 0.326 22 0
## 730 0.141 22 0
## 731 0.323 34 1
## 732 0.259 22 1
## 733 0.646 24 1
## 734 0.426 22 0
## 735 0.560 53 0
## 736 0.284 28 0
## 737 0.515 21 0
## 738 0.600 42 0
## 739 0.453 21 0
## 740 0.293 42 1
## 741 0.785 48 1
## 742 0.400 26 0
## 743 0.219 22 0
## 744 0.734 45 1
## 745 1.174 39 0
## 746 0.488 46 0
## 747 0.358 27 1
## 748 1.096 32 0
## 749 0.408 36 1
## 750 0.178 50 1
## 751 1.182 22 1
## 752 0.261 28 0
## 753 0.223 25 0
## 754 0.222 26 1
## 755 0.443 45 1
## 756 1.057 37 1
## 757 0.391 39 0
## 758 0.258 52 1
## 759 0.197 26 0
## 760 0.278 66 1
## 761 0.766 22 0
## 762 0.403 43 1
## 763 0.142 33 0
## 764 0.171 63 0
## 765 0.340 27 0
## 766 0.245 30 0
## 767 0.349 47 1
## 768 0.315 23 0
Finally, the dataset is checked again for any remaining missing values. This step ensures that the data is complete and ready for exploratory analysis and modeling. Any missing values that persist after imputation may need further investigation or handling, depending on the specific requirements of the analysis.
# Check for missing values
missing_values <- sapply(clean_data, function(x) sum(is.na(x)))
print(missing_values)## Pregnancies Glucose BloodPressure
## 0 0 0
## SkinThickness Insulin BMI
## 0 0 0
## DiabetesPedigreeFunction Age Outcome
## 0 0 0
## Rows: 768
## Columns: 9
## $ Pregnancies <dbl> 6, 1, 8, 1, 0, 5, 3, 10, 2, 8, 4, 10, 10, 1, …
## $ Glucose <dbl> 148, 85, 183, 89, 137, 116, 78, 115, 197, 125…
## $ BloodPressure <dbl> 72, 66, 64, 66, 40, 74, 50, 68, 70, 96, 92, 7…
## $ SkinThickness <dbl> 35, 29, 28, 23, 35, 27, 32, 39, 45, 36, 38, 3…
## $ Insulin <dbl> 175, 55, 325, 94, 168, 112, 88, 122, 543, 150…
## $ BMI <dbl> 33.6, 26.6, 23.3, 28.1, 43.1, 25.6, 31.0, 35.…
## $ DiabetesPedigreeFunction <dbl> 0.627, 0.351, 0.672, 0.167, 2.288, 0.201, 0.2…
## $ Age <dbl> 50, 31, 32, 21, 33, 30, 26, 29, 53, 54, 30, 3…
## $ Outcome <fct> 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, …
Principal Component Analysis (PCA) was utilized to simplify the complexity of high-dimensional data while preserving its essential features. This section explains how PCA was applied to our dataset and interprets its results.
PCA transformed correlated variables into a set of linearly uncorrelated components, known as principal components (PCs). These components were ordered by the amount of variance they explained in the data, with the first PC explaining the maximum variance and each subsequent PC explaining less.
Initially, the dataset, including dummy variables for the outcome categories, was centered and scaled. This normalization step ensured that each variable contributed equally to the analysis, regardless of its original scale or units.
The PCA results included a summary of the variance explained by each principal component. This information helped in understanding how much information each PC retained from the original dataset. It enabled us to decide how many principal components to retain based on the cumulative variance explained.
# Convert Outcome to dummy variables
clean_data$Outcome_0 <- ifelse(clean_data$Outcome == 0, 1, 0)
clean_data$Outcome_1 <- ifelse(clean_data$Outcome == 1, 1, 0)
# Remove the original Outcome column
clean_data <- clean_data[, !names(clean_data) %in% "Outcome"]
# Perform PCA on your data including dummy variables
pc <- prcomp(clean_data, center = TRUE, scale. = TRUE)
# Summary of the PCA results
summary(pc)## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.8618 1.2280 1.1425 0.97142 0.95358 0.85786 0.6512
## Proportion of Variance 0.3466 0.1508 0.1305 0.09437 0.09093 0.07359 0.0424
## Cumulative Proportion 0.3466 0.4974 0.6280 0.72233 0.81326 0.88685 0.9293
## PC8 PC9 PC10
## Standard deviation 0.61864 0.56984 1.692e-16
## Proportion of Variance 0.03827 0.03247 0.000e+00
## Cumulative Proportion 0.96753 1.00000 1.000e+00
# Percentage of variance explained by each principal component
pc_var <- pc$sdev^2 / sum(pc$sdev^2) * 100
# Cumulative variance explained by principal components
pc_cumvar <- cumsum(pc_var)
# Plot the variance explained by each principal component
barplot(pc_var, main = "Variance Explained by Principal Components",
xlab = "Principal Component", ylab = "Percentage of Variance Explained")# Plot the cumulative variance explained by principal components
plot(pc_cumvar, type = "b", main = "Cumulative Variance Explained by Principal Components",
xlab = "Number of Principal Components", ylab = "Cumulative Percentage of Variance Explained")
### Cumulative Variance Plot Analysis
The x-axis of the plot represents the number of principal components included in the analysis. It starts from 1 (the first principal component) and extends to the total number of features in the dataset.
On the y-axis, you’ll find the cumulative percentage of variance explained by the selected principal components. As you add more principal components, this percentage increases, reflecting the total amount of variance captured by those components. This plot is crucial for deciding the number of components needed to explain a high percentage of the dataset’s variance effectively, typically aiming for 90% or more.
The curve on the plot begins at the bottom left corner, near 0% explained variance, and ascends rapidly at the start. This steep ascent indicates that the initial principal components account for a significant portion of the variance in the data. As more components are added, the curve gradually flattens out.
The cumulative explained variance plot in PCA guides the selection of the optimal number of principal components by showing how much variance is captured as components are added. It balances the trade-off between retaining enough information for analysis while avoiding the inclusion of redundant components that do not significantly contribute to explaining the dataset’s variance.
Moreover, methods such as Kaiser’s criterion was used to determine the optimal number of principal components. It suggests retaining only those components with eigenvalues (variance explained by each component) greater than one.
# Applying Kaiser's Criterion to PCA Results
# Calculate eigenvalues
eigenvalues <- pc$sdev^2
# Print the eigenvalues
print(eigenvalues)## [1] 3.466357e+00 1.508074e+00 1.305217e+00 9.436612e-01 9.093126e-01
## [6] 7.359159e-01 4.240280e-01 3.827130e-01 3.247216e-01 2.864546e-32
# Filter eigenvalues greater than 1
optimal_components <- eigenvalues[eigenvalues > 1]
# Print the eigenvalues of principal components greater than 1
print(optimal_components)## [1] 3.466357 1.508074 1.305217
Variable loadings represented the correlation coefficients between the original variables and the principal components. These coefficients indicated the strength and direction of each variable’s contribution to the principal components. Bar plots of variable loadings visualised which variables were most influential in each principal component.
# Extract variable loadings for PC1 to PC4
variable_loadings <- as.data.frame(pc$rotation[, 1:3])
# Function to create bar plot for variable loadings
create_bar_plot <- function(pc_num) {
bar_data <- data.frame(variable = rownames(variable_loadings),
loading = variable_loadings[, pc_num])
ggplot(bar_data, aes(x = reorder(variable, loading), y = loading)) +
geom_bar(stat = "identity", fill = "#0073C2FF") +
coord_flip() +
labs(title = paste("Variable Loadings for PC", pc_num),
x = "Variable", y = "Loading")
}
# Create bar plots for variable loadings of PC1 to PC4
bar_plot_pc1 <- create_bar_plot(1)
bar_plot_pc2 <- create_bar_plot(2)
bar_plot_pc3 <- create_bar_plot(3)
#bar_plot_pc4 <- create_bar_plot(4)
# Display the plots
bar_plot_pc1In PCA, variable loadings quantify the contribution of each original variable to the variance explained by each principal component. These loadings are crucial for understanding which variables are most influential in defining each PC.
Summary: The sign and magnitude of loadings indicate the direction and strength of the relationship between each original variable and PC1. Higher absolute loading values signify greater influence on the variance explained by PC1. Positive loadings indicate a direct correlation, while negative loadings signify an inverse correlation with PC1.
Visualising PCA results was crucial for interpreting the relationships between data points and understanding the distribution of variables across principal components. Scatter plots and variable representation plots (showing arrows indicating variable contributions) provided intuitive insights into how variables clustered and correlated in reduced-dimensional space.
# Extract PCA components for clustering
pca_data <- pc$x
# Visualize PCA results (scatter plot)
pca_scatter <- fviz_pca_ind(pc, geom.ind = "point",
pointshape = 21, palette = "jco",
addEllipses = TRUE, ellipse.level = 0.95,
repel = TRUE) +
ggtitle("PCA Visualisation")
# Variable representation (arrows)
var_representation <- fviz_pca_var(pc, col.var = "contrib",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE, axes = c(1, 2), arrows = TRUE) +
labs(title = "Variable Representation")
# Combine plots
combined_plot <- grid.arrange(pca_scatter, var_representation, ncol = 2)and statistical techniques like the elbow method, silhouette method, and gap statistics were employed. These methods helped in selecting a suitable number of clusters that best captured the variability in the data while avoiding overfitting. The number of clusters chosen were four based on these methods including average Silhouette width score that was performed after clustering.
The Elbow method is a technique used in clustering algorithms, particularly K-means clustering, to determine the optimal number of clusters \(k\) in a dataset. It involves plotting the within-cluster sum of squares (WSS) against different values of \(k\). The “elbow” point on the plot represents the optimal \(k\) where the rate of decrease in WSS slows down, indicating that adding more clusters does not significantly improve the clustering performance.
# Elbow Method
elbow_method <- function(pca_data, max_k) {
wss <- numeric(max_k)
for (i in 1:max_k) {
kmeans_model <- kmeans(pca_data, centers = i, nstart = 10)
wss[i] <- sum(kmeans_model$tot.withinss)
}
plot(1:max_k, wss, type = "b", xlab = "Number of Clusters (k)", ylab = "Total Within Sum of Squares (WSS)", main = "Elbow Method")
}
# Call elbow_method function
elbow_method(pca_data, max_k = 10)The Silhouette Method is a technique used to determine the optimal number of clusters \(k\) in a dataset for clustering algorithms like K-means. It evaluates how similar each point in one cluster is to points in its own cluster compared to points in other clusters.
# Silhouette Method
silhouette_method <- function(pca_data, max_k) {
silhouette_scores <- numeric(max_k)
for (i in 2:max_k) {
kmeans_model <- kmeans(pca_data, centers = i, nstart = 10)
silhouette_obj <- silhouette(kmeans_model$cluster, dist(pca_data))
silhouette_scores[i] <- mean(silhouette_obj[, "sil_width"])
}
plot(2:max_k, silhouette_scores[2:max_k], type = "b", xlab = "Number of Clusters (k)", ylab = "Silhouette Score", main = "Silhouette Method")
}
# Call silhouette_method function
silhouette_method(pca_data, max_k = 10)
### Gap Statistics
Gap Statistics is a method used to determine the optimal number of clusters \(k\) in a dataset for clustering algorithms such as K-means. It compares the within-cluster variation (sum of squares) of the clustering algorithm’s output with that of a reference null distribution that represents data with no obvious clustering structure.
gap_statistics <- function(pca_data, max_k, B = 10) {
gap <- numeric(max_k - 1) # Initialize with length max_k - 1
for (i in 2:max_k) { # Start from 2 instead of 1
print(paste("Calculating gap statistic for k =", i))
kmeans_model <- kmeans(pca_data, centers = i, nstart = 10)
gap_result <- clusGap(pca_data, FUNcluster = kmeans, K.max = i, B = B)
gap[i - 1] <- max(gap_result$Tab[, "gap"]) # Use maximum gap value
}
plot(2:max_k, gap, type = "b", xlab = "Number of Clusters (k)", ylab = "Gap Statistic", main = "Gap Statistics")
}
# Call gap_statistics function
gap_statistics(pca_data, max_k = 10)## [1] "Calculating gap statistic for k = 2"
## [1] "Calculating gap statistic for k = 3"
## [1] "Calculating gap statistic for k = 4"
## [1] "Calculating gap statistic for k = 5"
## [1] "Calculating gap statistic for k = 6"
## [1] "Calculating gap statistic for k = 7"
## [1] "Calculating gap statistic for k = 8"
## [1] "Calculating gap statistic for k = 9"
## [1] "Calculating gap statistic for k = 10"
## Warning: did not converge in 10 iterations
K-means clustering is a method used to partition a dataset into distinct groups (clusters) based on similarity. By minimising the variance within each cluster, K-means aims to create groups where the data points within each group are more similar to each other than to those in other groups. This analysis applied K-means clustering to the results of Principal Component Analysis (PCA) to understand the underlying patterns in the data.
To explore the dataset’s structure, we performed K-means clustering using the first three principal components (PC1 to PC3). We set the number of clusters (K) to 4. The PCA results provided a reduced dimensionality space, making it easier to visualise and interpret the clusters.
## Cluster Visualization The clusters were visualised in two-dimensional and three-dimensional plots. The 2D plot displayed the data points along PC1 and PC2, colored according to their cluster assignments. A 3D scatter plot further illustrated the clustering in the PCA-reduced space, showing the separation and distribution of clusters across PC1, PC2, and PC3.
# Extract PC scores for PC1 to PC2
pc_scores <- as.data.frame(pc$x[, 1:3])
# Perform K-means clustering with K = 4 using PCA results
kmeans_result <- kmeans(pc_scores, centers = 4, nstart = 10)
# Visualize the clusters
ggplot(pc_scores, aes(x = PC1, y = PC2, color = factor(kmeans_result$cluster))) +
geom_point() +
scale_color_discrete(name = "Cluster") +
labs(title = "K-means Clustering Results (K = 4)", x = "Principal Component 1", y = "Principal Component 2")
## Observations from Visualisation (scatter plot) - Cluster
Distribution: The visualisations revealed distinct groupings,
indicating that the K-means algorithm effectively identified separate
clusters within the data. - Cluster Separation:
Clusters exhibited overlapping boundaries, suggesting potential some
dissimilarity between those groups.
# Extract PC scores for PC1 to PC3
pc_scores <- as.data.frame(pc$x[, 1:3])
# Perform K-means clustering with K = 6 using PCA results
set.seed(123) # Setting seed for reproducibility
kmeans_result <- kmeans(pc_scores, centers = 4, nstart = 10)
# Visualize the clusters using fviz
fviz_cluster(kmeans_result, data = pc_scores, geom = "point",
ellipse.type = "convex",
palette = "jco",
ggtheme = theme_minimal(),
main = "K-means Clustering Results (K = 4)")
## Observations from Visualisation (fviz_cluster) - Cluster
Sizes: The clusters varied in size, reflecting the dataset’s
inherent structure and the algorithm’s ability to group similar
observations. - Dissimilarity: Clusters showed varying
degrees of internal dissimilarity, with some clusters being more
homogeneous than others.
3D Visualization of clusters
# Visualize the clusters in 3D
plot_ly(data = pc_scores, x = ~PC1, y = ~PC2, z = ~PC3, color = factor(kmeans_result$cluster),
type = "scatter3d", mode = "markers", marker = list(size = 6)) %>%
layout(title = "K-means Clustering Results (K = 4)",
scene = list(xaxis = list(title = "Principal Component 1"),
yaxis = list(title = "Principal Component 2"),
zaxis = list(title = "Principal Component 3")))Cluster Statistics: To gain deeper insights into the clusters, we computed various statistics: - Number of Observations: Each cluster’s size was calculated to understand the distribution of data points across clusters. - Dissimilarity Measures: Maximum and average dissimilarities within each cluster were evaluated using the Gower distance, a metric suitable for mixed data types. - Isolation: The isolation of each cluster was assessed by measuring the minimum distance between cluster centers, indicating how distinct each cluster is from others.
# Perform K-means clustering with K=6 using PCA results
set.seed(123) # For reproducibility
kmeans_result <- kmeans(pc_scores[, 1:3], centers = 4, nstart = 10)
# Add cluster labels to the PCA scores
pc_scores_with_clusters <- cbind(pc_scores, cluster = kmeans_result$cluster)
# Function to compute cluster statistics
compute_cluster_stats <- function(cluster_data, cluster_centers) {
cluster_stats <- cluster_data %>%
group_by(cluster) %>%
summarise(
number_obs = n(),
max_dissimilarity = max(daisy(cluster_data[, 1:3])),
average_dissimilarity = mean(daisy(cluster_data[, 1:3]))
)
isolation <- sapply(1:nrow(cluster_centers), function(i) {
min_dist <- min(dist(rbind(cluster_centers[i, ], cluster_centers[-i, ])))
return(min_dist)
})
cluster_stats$isolation <- isolation
return(cluster_stats)
}
# Compute dissimilarities and cluster statistics
dissimilarities <- daisy(pc_scores_with_clusters[, 1:3])
cluster_centers <- kmeans_result$centers
cluster_stats <- compute_cluster_stats(pc_scores_with_clusters, cluster_centers)
# Print the cluster statistics
print(cluster_stats)## # A tibble: 4 × 5
## cluster number_obs max_dissimilarity average_dissimilarity isolation
## <int> <int> <dbl> <dbl> <dbl>
## 1 1 219 10.5 3.23 2.40
## 2 2 266 10.5 3.23 2.40
## 3 3 127 10.5 3.23 2.40
## 4 4 156 10.5 3.23 2.40
Cluster Sizes: The number of observations in each cluster varied, with Cluster 2 being the largest (266 observations) and Cluster 3 being the smallest (127 observations). This variation in cluster sizes indicates a non-uniform distribution of data points across the clusters.
Maximum Dissimilarity: All clusters exhibited the same maximum dissimilarity value of 10.51. This value represents the most dissimilar pair of observations within each cluster and highlights the maximum internal variability.
Average Dissimilarity: The average dissimilarity within each cluster was consistent across all clusters at 3.23. This measure indicates the typical distance between observations within the same cluster, suggesting a similar level of internal cohesion.
Isolation: The isolation metric, which measures the minimum distance between cluster centers, was also identical for all clusters at 2.40. This value reflects the degree of separation between the clusters, indicating that each cluster is equally distinct from the others.
Evaluating the performance and validity of the K-means clustering algorithm is essential to ensure that the clusters formed are meaningful and distinct. This section presents the evaluation of the K-means clustering performed on the dataset using several key metrics.
Calinski-Harabasz Index: The Calinski-Harabasz index, also known as the Variance Ratio Criterion, measures the ratio of the sum of between-cluster dispersion to the sum of within-cluster dispersion. Higher values of the Calinski-Harabasz index indicate better-defined and more distinct clusters. For the given clustering solution, the Calinski-Harabasz index was computed, providing a quantitative assessment of cluster separation and compactness.
Dunn Index: The Dunn index is another metric used to evaluate the clustering quality by considering both the minimum inter-cluster distance and the maximum intra-cluster distance. A higher Dunn index indicates better clustering, as it signifies well-separated and compact clusters. The Dunn index was calculated for the clustering solution, helping to confirm the distinctiveness of the clusters.
Silhouette Coefficient: The silhouette coefficient measures how similar an object is to its own cluster compared to other clusters. It ranges from -1 to 1, where a value close to 1 indicates that the data point is well-matched to its own cluster and poorly matched to neighboring clusters. The silhouette coefficients for each data point were calculated, and a silhouette plot was generated to visually assess the clustering quality. The plot highlighted the cohesion within clusters and the separation between different clusters.
# Assuming clusters contains the cluster assignments from clustering algorithm
# Compute the distance matrix
dist_matrix <- dist(pca_data)
# From kmeans clustering
clusters <- kmeans_result$cluster
# Compute Calinski-Harabasz Index
calinski_harabasz <- fpc::cluster.stats(dist_matrix, clusters)$ch
# Compute Dunn Index
dunn_index <- fpc::cluster.stats(dist_matrix, clusters)$dunn
# Print Evaluation Metrics
print(paste("Calinski-Harabasz Index:", calinski_harabasz))## [1] "Calinski-Harabasz Index: 180.344150775887"
## [1] "Dunn Index: 0.0408701186773627"
## [1] "dist"
## [1] "integer"
## [1] 768 768
## [1] 768
# Function to calculate Silhouette Coefficient
calculate_silhouette <- function(data, clusters) {
library(cluster)
sil <- silhouette(clusters, dist(pca_data))
return(sil)
}
# Calculate silhouette coefficients
sil_scores <- calculate_silhouette(pca_data, clusters)
# Plot silhouette plot
fviz_silhouette(sil_scores, palette = "jco", main = "Silhouette Plot for K-means Clustering")## cluster size ave.sil.width
## 1 1 219 0.27
## 2 2 266 0.16
## 3 3 127 0.14
## 4 4 156 0.16
To further understand the characteristics of each cluster, the mean values of the original variables were computed for each cluster. This analysis provided insights into the distinguishing features of each cluster, revealing how different variables contributed to the clustering results. A bar plot was created to visualise the cluster profiles, showing the mean values of the variables for each cluster and highlighting the differences between them.
# Perform K-means clustering with K = 4
set.seed(123) # Set seed for reproducibility
kmeans_result <- kmeans(pca_data, centers = 4, nstart = 25)
clusters <- kmeans_result$cluster
# Assign cluster labels to clean_data
clean_data_with_clusters <- cbind(clean_data, Cluster = kmeans_result$cluster)
# Calculate mean of original variables by cluster
cluster_means <- aggregate(. ~ Cluster, data = clean_data_with_clusters, FUN = mean)
# Reshape data for plotting (assuming clean_data has appropriate column names)
cluster_means_long <- pivot_longer(cluster_means, cols = -Cluster, names_to = "Variable", values_to = "Mean")
# Create bar plot
ggplot(cluster_means_long, aes(x = Variable, y = Mean, fill = factor(Cluster))) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Cluster Profiles (Clean Data)",
x = "Variable",
y = "Mean",
fill = "Cluster") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))Plotly Bar Chart
Age: The average age in Cluster 1 is 26.42
years.
Outcome: Only 9.9% of individuals in this cluster have
diabetes (Outcome_1 = 0.099), while 90% do not (Outcome_0 = 0.90).
Glucose: The mean glucose level is 105.89 mg/dL, which
is relatively low compared to the other clusters.
Insulin: Average insulin level is 96.2 µU/mL which is
relatively normal.
BMI: The mean BMI is 26.75, indicating that individuals
in this cluster are slightly overweight according WHO
classification.
Blood Pressure: The average blood pressure is 65.09
mmHg, which is the lowest among all clusters.
Diabetes Pedigree Function: The mean value is 0.428,
suggesting a mild genetic predisposition to diabetes.
Pregnancies: On average, individuals have 2.29
pregnancies.
Skin Thickness: The mean skin thickness is 21.079 mm,
indicating thinner skinfold measurements compared to other clusters.
Age: The average age in Cluster 2 is 35.05
years.
Outcome: A significant 76.5% of individuals in this
cluster have diabetes (Outcome_1 = 0.765), while only 23.4% do not
(Outcome_0 = 0.234).
Glucose: The mean glucose level is 164.67 mg/dL, the
highest among all clusters.
Insulin: Average insulin level is 295 µU/mL, indicating
higher insulin resistance or insulin therapy.
BMI: The mean BMI is 37, categorizing individuals in
this cluster as obese.
Blood Pressure: The average blood pressure is 73.5
mmHg.
Diabetes Pedigree Function: The mean value is 0.693,
indicating a higher genetic predisposition to diabetes.
Pregnancies: On average, individuals have 3.86
pregnancies.
Skin Thickness: The mean skin thickness is 34.4 mm,
indicating thicker skinfold measurements.
Age: The average age in Cluster 3 is 28 years.
Outcome: 33.1% of individuals in this cluster have
diabetes (Outcome_1 = 0.331), while 69% do not (Outcome_0 = 0.69).
Glucose: The mean glucose level is 113 mg/dL.
Insulin: Average insulin level is 128 µU/mL.
BMI: The mean BMI is 37.46, indicating that individuals
in this cluster are morbidly obese.
Blood Pressure: The average blood pressure is 74.1
mmHg.
Diabetes Pedigree Function: The mean value is 0.45,
indicating a moderate genetic predisposition to diabetes.
Pregnancies: On average, individuals have 2.06
pregnancies.
Skin Thickness: The mean skin thickness is 35.4 mm, the
highest among all clusters.
Age: The average age in Cluster 4 is 47 years.
Outcome: 44.9% of individuals in this cluster have
diabetes (Outcome_1 = 0.449), while 55% do not (Outcome_0 = 0.55).
Glucose: The mean glucose level is 127 mg/dL.
Insulin: Average insulin level is 141 µU/mL.
BMI: The mean BMI is 32, categorising individuals in
this cluster as obese.
Blood Pressure: The average blood pressure is 80 mmHg,
the highest among all clusters.
Diabetes Pedigree Function: The mean value is 0.42,
indicating a moderate genetic predisposition to diabetes.
Pregnancies: On average, individuals have 8
pregnancies, the highest among all clusters.
Skin Thickness: The mean skin thickness is 30.42
mm.
Cluster 1: Low Risk Young Adults
Intervention Focus: Prevention and Education
For Cluster 1, comprising young adults with a low risk of diabetes, strategic interventions should focus on prevention and education. Promoting a healthy lifestyle through continued encouragement of balanced eating habits and regular physical activity is essential to maintaining their normal weight and low glucose levels. Educational campaigns on diabetes prevention, specifically targeting young adults, can reinforce the importance of these habits. Additionally, advocating for routine health screenings to monitor vital signs such as glucose and insulin levels can help in early detection and prevention of diabetes.
Cluster 2: High Risk Middle-Aged Adults
Intervention Focus: Intensive Management and Support
Cluster 2 consists of middle-aged adults at high risk for diabetes, necessitating intensive management and support. Medical management, including insulin therapy and medications, is crucial to control high glucose and insulin levels. Specialized weight management programs should be implemented to address obesity and reduce related health risks. Given the high genetic predisposition to diabetes in this cluster, genetic counseling can provide valuable insights and management strategies. Regular health check-ups and increased monitoring frequency are imperative for early intervention and effective management of potential complications.
Cluster 3: Moderate Risk Young Adults
Intervention Focus: Risk Reduction and Monitoring
Young adults in Cluster 3, who face a moderate risk of developing diabetes, require interventions aimed at risk reduction and monitoring. Targeted diet and exercise programs can help address obesity and manage glucose levels. Regular health monitoring of glucose and insulin levels is essential to manage and reduce the risk of diabetes. Establishing support groups can provide necessary lifestyle modification guidance and peer support to encourage healthy habits. Preventive healthcare services, including routine screenings and early detection strategies, should be emphasized to mitigate the risk of diabetes.
Cluster 4: Moderate Risk Older Adults
Intervention Focus: Comprehensive Care and Lifestyle Adjustments
For older adults in Cluster 4 with a moderate risk of diabetes, a comprehensive care approach combining medical treatment and lifestyle adjustments is recommended. Chronic disease management programs tailored to older adults should focus on controlling BMI, blood pressure, and glucose levels. Encouraging participation in age-appropriate physical activity programs can improve overall health. These integrated care strategies aim to manage moderate risk factors effectively and improve health outcomes.
This report analysed the characteristics of four distinct clusters within a dataset, each representing different levels of diabetes risk. Cluster 1, comprising young adults with low diabetes risk, benefits from preventive education and lifestyle maintenance. Cluster 2, with high-risk middle-aged adults, requires intensive management and support, including medical treatments and weight management. Cluster 3, featuring young adults with moderate risk, should focus on risk reduction and continuous monitoring through targeted diet and exercise programs. Cluster 4, consisting of older adults with moderate risk, demands comprehensive care and lifestyle adjustments tailored to their specific health needs.
By identifying and understanding these clusters, we can implement strategic interventions that are tailored to the unique needs of each group. This targeted approach enhances the effectiveness of diabetes prevention and management efforts, ultimately leading to improved health outcomes and a reduction in the prevalence of diabetes.